Fisher Information in Flow Size Distribution 

Estimation 

Paul Tune, Member, IEEE, and Darryl Veitch, Fellow, IEEE 



Abstract — The flow size distribution is a useful metric for traf- 
fic modeling and management. Its estimation based on sampled 
data, however, is problematic. Previous worli has shown that flow 
sampling (FS) offers enormous statistical benefits over packet 
sampling but high resource requirements precludes its use in 
routers. We present Dual Sampling (DS), a two-parameter family, 
which, to a large extent, provide FS-like statistical performance 
by approaching FS continuously, with just packet-sampling- 
like computational cost. Our work utilizes a Fisher information 
based approach recently used to evaluate a number of sampling 
schemes, excluding FS, for TCP flows. We revise and extend 
the approach to make rigorous and fair comparisons between 
FS, DS and others. We show how DS significantly outperforms 
other packet based methods, including Sample and Hold, the 
closest packet sampling-based competitor to FS. We describe a 
packet sampling-based implementation of DS and analyze its 
key computational costs to show that router implementation 
is feasible. Our approach offers insights into numerous issues, 
including the notion of 'flow quality' for understanding the 
relative performance of methods, and how and when employing 
sequence numbers is beneficial. Our work is theoretical with some 
simulation support and case studies on Internet data. 

Index Terms — Fisher information, flow size distribution, Inter- 
net measurement, router measurement, sampling. 

I. Introduction 

The distribution of flow size, that is the number of packets in 
a flow, is a useful metric for traffic modelling and management, 
and is important for security because of the role small flows 
play in attacks. As is now well known however, its estimation 
based on sampled data is problematic. 

Currently, sampling decisions in routers are made on a per- 
packet basis, with only sampled packets being subsequently 
assembled into (sampled) flows. Duffield et al. [I] were the 
first to point out that simple packet sampling strategies such 
as '1 in N' periodic or i.i.d. (independent, identically dis- 
tributed) packet sampling have severe limitations, in particular 
a strong flow length bias which allows the tail of the flow 
size distribution to be recovered, but dramatically obscures 
the details of small flows. They explored the use of TCP SYN 
packets to improve the resolution at the small flow end of 
the spectrum. Hohn et al. ||2l, ||3] explored these difficulties 
further and pointed out that ^ow sampling, where the sampling 
decision is made directly on flows, resulting in all packets 
belonging to any sampled flows being collected, has enormous 
statistical advantages. However, flow sampling has not been 
pursued further nor found its way into routers, partly because 
it implies that lookups be performed on every packet, which 
is very resource intensive. 
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More recently, Ribeiro et al. explored the use of TCP 
sequence numbers to improve estimation for TCP flows. The 
idea is that the presence of packets which are not physically 
sampled can be inferred by noting the increasing byte count 
given by the sequence number fields of sampled packets. By 
using the Fisher information as a metric of the effectiveness of 
sampling in retaining information about the original flow sizes, 
they showed that this helps greatly to 'fill in the holes' left by 
packet sampling. However, they did not address whether these 
techniques out-perform flow sampling (FS). 

In this paper we revisit FS in the context of TCP flows. 
Our first contribution is to explain how the approach of 
m can be reformulated and extended to include FS. This 
provides a framework for our second contribution, proofs that 
FS outperforms existing methods by a large margin, though 
they themselves greatly improve upon simple packet sampling. 
Our results are rigorous, based on explicit calculation and 
comparison of the Fisher Information matrices of competing 
schemes. With the statistical reputation of FS thus reinforced, 
the challenge is to find methods which can somehow approach 
or approximate flow sampling in order to benefit from its 
information theoretic efficiency, but with lower resources 
requirements. We show how this can be done. 

The computational problem for FS can be described as 
follows. To capture the variety present in traffic flows and 
to provide the raw material for a variety of current (and 
future) metrics, many flows must be sampled. This implies 
large flow tables which in turn implies the use of slower but 
cheaper DRAM rather than the faster but expensive SRAM |]5l- 
However, DRAM is not fast enough to perform lookups for 
every packet, as required by a straightforward implementation 
of FS, for today's high capacity links. The question then 
becomes, how can flow sampling be implemented using per- 
packet decisions, in other words using some form of packet 
sampling? 

The main contribution of this paper is the introduction 
of Dual Sampling (DS), a hybrid approach combining the 
advantages of both packet and flow sampling. It is a two 
parameter sampling family which includes FS as a special 
case and allows FS to be approached continuously, enabling 
a tradeoff of sampling efficiency against computational cost. 
Computationally, it can be implemented via a modified form 
of two-speed or 'dual' packet sampling which circumvents 
the problem of slow DRAM. There is a cost in terms of 
wasted samples, but we show that this can be borne in high 
speed routers. Following DS benefits from the use of 
TCP sequence numbers although it can also be used without 
them, and we provide insight into how and when they have an 
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impact. We show rigorously that DS outperforms the methods 
proposed in H- We also compare and contrast DS with the 
well known 'Sample and Hold' scheme |6j. We show that 
Sample and Hold performs quite well, though not as well as 
DS. 

Finally, we introduce SYN+SEQ+FIN, another sampling 
method which enables flow sampling to be perfectly achieved 
(aside from errors in the mapping of byte to packet counts) at 
very low computational cost, well below that even of packet 
sampling. Its disadvantage is that it exploits the TCP FIN field, 
when not all TCP flows terminate correctly with a FIN packet. 

With its explicit use of TCP protocol information in most 
cases, our work applies to TCP flows only. However, the 
ideas and results could apply to other kinds of flows provided 
that suitable substitutes could be found for connection startup 
(SYN), 'progress' (sequence numbers) and termination (FIN). 
TCP flows still constitute the overwhelming majority of traffic 
in the Internet. 

The rest of the paper is organized as follows. Section |ll] 
describes our sampling framework and derives the Fisher 
information matrix and its inverse explicitly. Section |lll] de- 
fines the sampling methods and derives their main properties. 
Section |IV] compares the methods theoretically and derives 
further properties explaining their performance, and Section FV] 
explores in more detail how sequence numbers reduce esti- 
mation variance. Section |VI] introduces a simple model for 
computational cost and uses it to define and solve an opti- 
mization problem for sampling performance under constraints. 
Section lVTIl applies the methods to real Internet data and shows 
that DS performs favorably with flow sampling in practice, and 
has better performance than Sample and Hold. A closed-form 
unbiased estimator was proposed for DS and Sample and Hold 
which achieves the Cramer-Rao lower bound asymptotically, 
eliminating the need for iterative optimization algorithms. We 
conclude and discuss future work in Section IVIIII 

This paper is an extended and enhanced version of the con- 
ference paper l?). The main additions relate to the inclusion 
of the Sample and Hold method throughout the paper, several 
new theorems and counter-examples on method comparison, 
and the inclusion of a new major data set. 

II. The Sampling Framework 

In this section we establish a framework to define and 
analyze sampling techniques applied to an idealized view of 
TCP flows on a link. Nominally, we imagine that such flows 
are defined by the usual 5-tuple of origin and destination IP 
addresses, port numbers, and TCP protocol field together with 
a timeout. For the analysis we make a number of simplifying 
assumptions; 

(i) flows begins with a SYN packet and have no others, 

(ii) flows are not split (this can occur through timeouts or 
flow table clearing), 

(iii) all necessary protocol information (5-tuple, SYN/FIN 
bits and sequence numbers) can be observed, and 

(iv) per-flow sequence numbers count packets, not bytes. 

Assumptions (iii) and (iv) will be discussed/relaxed when we 
deal with real data in Section IVIII Note that we do respect 



TCP's per-flow random initialization of sequence numbers. 
Hence their absolute value holds no information on the number 
of packets in a flow, only differences of sequence numbers 
matter This is crucial for the analysis. 

A. The Flow Model 

We consider a measurement interval containing Nf flows. 
Let nii denote the size of flow i (the number of packets it 
contains). It satisfies 1 < nii < W, where 1 < < co 
is the maximum flow size. The total number of packets is 

n = l^i^i mj. 

Let Mj be the number of flows of size j, 1 < j < W, 
that is M, = E.:™,=, 1, and Nj = M,. The flow 

size 'distribution' is the set = {di, 02, ■ ■ ■ , Ow} of relative 
frequencies, that is 
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where < 61^ < 1 and J^f^i = 1- Note that {M^} and 
{0, Nf} are equivalent and complete descriptions of the flows 
in the measurement interval. They are sets of deterministic 
parameters, not random variables, effectively a deterministic 
flow size model. 

Most of the literature on traffic sampling follows the 
above viewpoint, where the data is deterministic, the only 
randomness being introduced through the sampling itself. An 
exception is the work of Hohn and Veitch (ID, jS)) where 
randomness arises both through the traffic model and the 
sampling process, which makes the analysis considerably more 
difficult, but less generic. 

B. General Random Sampling 

The effect on flow size of random sampling can be described 
as follows; from an original flow of size k, only j packets, 
< j 1^ k, are sampled (retained), with probability 



Pr(sampled flow has j pkts | original flow has k pkts). 



The operation of the sampling scheme is entirely defined by 
the bjh, which can be assembled into a + x W sampling 
matrix B, whose + k)-th element is bjk- Note that bjk = 
for j > k. By definition B is a (column) stochastic matrix, 
that is each element obeys bjk > 0, and each column sums to 
unity. 

The experimental outcome can be described by a set of 
random variables {Mj \ < j < W} where Afj counts the 
number of sampled flows of size j. Thus, 

i-.m'^—j i—1 

Equivalendy, let 6' = {9'j | < j < W} (note that the index 
i includes 0) denote the empirical distribution of sampled flow 
sizes, where 



(2) 
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where < 6lj < 1 and Yl^^o = 1- Note that 6*^) > as 
some flows may not survive the thinning process. Out of the 
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original Nf flows, only N} ^ Nf{l - 9'„) = Nf - A/^ flows 
survive sampling. 

Define a normalized set of fractions of the sampled flow 

sizes 7 = {7j | 1 < j < W} by 

„ _ 

where < 7j < 1 and X^JLi 7j = 1- The set 7 constitutes 
the naive or directly measurable sampled flow size distribution 
and is equivalent to the distribution of flows conditional on at 
least a single packet from that flow being sampled. For j > 1, 
e'j is related to 7^- by (1 - d'a)-fj ^ (Nj/Nf)-/, 9'j. 

C. The Unconditional Formulation 

The sampled flow above includes the case, j — 0, where 
the flow 'evaporates'. It seems natural to conclude however 
that such cases cannot be observed. This logically leads to 
an analysis based on the observation of the 7^ defined above 
where j > 1, which is effectively conditional: sample flow 
distributions given that at least one packet is sampled. This 
is the approach adopted in U, lH, lH and in the literature 
generally. One of the key differences in our work is that we 
show that it is possible to observe the j = case, leading to 
an unconditional formulation which enjoys many advantages. 

To see how this is possible we return to general context of 
Nf flows, each one of which will be sampled in this general 
sense. As defined above Nj- is the number of flows of size at 
least 1 after sampling. The number of evaporated flows is just 
Nf — Np but typically Nf is not known and is regarded as 
a 'nuisance parameter' which must be estimated. However, it 
can easily be measured by directly counting the total number 
of SYN packets, which is just the number of flows Nf. For 
methods which are akeady assuming an ability to access and 
perform specific actions based on whether a packet is a SYN 
or not, the additional assumption of being able to count all 
SYN packets is a natural one. It is also implementable, as 
a single additional counter which checks every packet and 
conditionally increments based on a small number of header 
bits is not difficult even at the highest speeds Q, as we discuss 
in more detail in Section |Vll In summary, by knowing Nf, 
every flow gives rise to a sampled flow, each one of which 
is observable, either directly (j > 1), or indirectly (j = 0). 
In other words, we can in effect observe the 9j over the full 
range from j = to j = W. 

The chief advantage of the unconditional formulation is the 
very simple form of the likelihood function for the experimen- 
tal outcome j for a single flow. This makes the manipulation 
of the Fisher information far more tractable, leading to new 
analytic results and insights. The other big advantage is that 
flow sampling can now be included. In the conditional world 
flow sampling is perfect - by definition, if a flow is sampled 
at all, all its packets will be and so there is nothing to do! 
The unconditional framework allows the missing part of the 
picture to be included, enabling meaningful comparison. 

D. The Sampled Flow Distribution 

Our analysis is based on the idea of selecting a 'typical' 
flow, and that flows are mutually independent (a reasonable 
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Fig. 1. The flow sampling and selection process. Here a flow is selected 
which had fc = 5 packets originally and J = 3 after sampling. 

assumption if Nf is very large). Since flows are in fact 
deterministic, this is only meaningful if we introduce a sup- 
plementary random variable U, a uniform over the Nf flows 
available, which performs the random flow selection. This 
variable, which acts 'invisibly' behind the scenes (and is rarely 
discussed), is not part of the random sampling scheme itself, 
but is essential as it allows the 6 parameters to be treated as 
probabilities, even though they are not. An example is given 
in Figure [T] which shows Nf = 12 flows before and after 
sampling, followed by a random flow selection. In the interests 
of clarity a flow of size j = i after sampling was selected, 
but it could have been one of the evaporated flows (j ~ 0). 

With this background established, the discrete distribution 
for a sampled flow originally of size k is very simple; 

w 

dj^Y.^3k0k, Q<3<W. (4) 
fc=i 

This can be expressed in matrix notation as 

d = (5) 

where d = [do, c^i, ^2, • • ■ , '^w]'^ is a (W + 1) x 1 column 
vector, and a VF x 1 column vector. The probability dj is 
related to the empirical fraction 0'^ for j > by 

^ Nf Nf 

Nf Nf 
The likelihood function for the parameters is simply 

fij;e)^d„ 0<j<W. (6) 

In the conditional framework commonly used j = is missing, 
and normalization is then needed to ensure probabilities add 
to one. This implies a division of random variables, which 
greatly complicates the likelihood. 

E. The Fisher Information of a Sampled Flow 

The parameter vector is the unknown we would like to 
estimate from sampled flows. Since here we are not concerned 
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with specific estimators of 6, but in the effectiveness of the 
underlying sampling scheme, a powerful approach (introduced 
in lIU) is to use the Fisher information |0 Section 11.10] to 
access its efficiency in collecting information about 9. 

We first introduce notation that will be used throughout this 
paper. The expectation of a random variable X is denoted by 
E[X], and the variance by Var(X). Matrices are written in 
bold-face upper case and vectors in bold-face lower case. The 
transpose of a matrix A is denoted by A^. The operator also 
applies to vectors. The operator tr(A) denotes the trace of the 
matrix A. The matrix I„ denotes the n x n identity matrix. 
The vector 1„ = [1, 1, . . . , 1]""^ denotes an n x 1 vector of Is. 
The vector 0„ denotes the nxl null vector, and the mxn null 
matrix is written as 0,„x„. Given an n x 1 vector x, diag(x) 
denotes an n x n matrix with diagonal entries xi,X2, ■ . ■ ,Xn. 

Definition 1: An n x n real matrix M is positive definite 
iff for all vectors z e R"\{On}, z'^Mz > 0, and is positive 
semidefinite iff z^Mz > 0. 

We write A > or < A to indicate that A is positive 
definite. For two matrices A and B, we write A > B to 
mean A — B > in the positive definite sense. Similarly, 
A > B and A — B > each mean that A — B is positive 
semidefinite. The operator | • | returns the size of a vector or 
set. All other definitions will be defined when needed. 

The Fisher information is useful because its inverse is the 
Cramer-Rao lower bound (CRLB), which lower-bounds the 
variance of any unbiased estimator of 9. In fact the Fisher 
information takes a different form depending on whether 
constraints are imposed on the 9 or not ifTOll . Inequality 
constraints are particularly problematic, so we avoid them by 
assuming that each 6k obeys < 6'^ < 1 (this ensures that 
the CRLB optimal solution cannot include boundary values, 
which would create bias and thereby invalidate the use of 
the unbiased CRLB). Assuming that flows exist for all sizes, 
i.e. that 6k > Q for all k, is reasonable given the huge number 
of simultaneously active flows (up to a million) in high end 
routers. There is one more constraint, the equality constraint 
= 1, which must be included. As this complicates 
the Fisher information, we first deal with the unconstrained 
case. 



F. The Unconstrained Fisher Information 

The Fisher information is based on the likelihood and is 
defined by 

3{9) = E[(Velog/(j;0))(Velog/(j;0))T] 
w 

= Y,{Ve\ogf{j-e))iye\ogJ{3-9)fd,. (7) 

Here \7 e log f {j ; 9) = {l/dj)[bji, . . . ,bjw]'^ because of the 
simple form ^ of the likelihood. This leads to the simple 
explicit expression {J{9))ik = Sj!lo '^T^' equivalently 

3{9) = B^B{9)B (8) 
where D(0) is a diagonal matrix with {'D{9))jj = dj}-^^. 



We will need to find the inverse of J, but since B is not 
square, this cannot be done directly from ([8]) in terms of the 
inverse of B. However if we re-express B as 



B 



B 



(9) 



where bj = [6oi, . . . , bow] is the top row of B and 



B 



then we can write 



bn 




bi2 

b22 







3(9) 
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do 



bobj 



biw 

b2W 

bww 



~3{9) 



.,d 



w ^ 



(10) 
and 



where 3(9) = B^r>{9)B, D(0) = diag(d-\ 
do = bg 0. Since B and D(0) are square, the inverse of 3{9) 
is just J-^^) = B-iD-i(0)(B-i)^, that is 



w 



(11) 



- (B"i)jfe. By Lemma[28];ii) in AppendixE 3{9) 



where b'^ 

is positive definite. We can now give the inverse of J. 
Proposition 2: The inverse of 3 (9) is given by 

Proof: The matrix inversion lemma applies (see 
Lemma |30] in Appendix |A]i with R = J and T = 1 /do 
nonsingular Since do + bjj^^(0)bo > bjj^^(0)bo > 
as is positive definite, the result immediately follows. ■ 
The diagonal elements of the matrix 3~^{9) will be important 
in later sections, and the explicit formula is given below: 



EfeL, dkb'jk ELi ^oeb'ik 



do + J2kLi dk Efci boeb'gi 



2 ■ 



{3-\9)),,=Y,b%dk- 

k=l _ .. 

'"'(12) 

Again, this explicit inverse, valid for any general sampling 
matrix B, is made possible by the very simple form of the like- 
lihood function in equation We now specialize the above 
result for sampling matrices that satisfy particular conditions. 
Although some of these matrices exhibit a dependence on 9, 
we drop this dependence for notational simplicity when the 
context is clear 

Corollary 3: If for some constant q the sampling matrix B 
satisfies bo = qlw then (setting p = 1 — q) 

J 1=J 



Proof: The given condition implies that do 



hl9 



qlw^^ = 9' l^B ^ = {l/p)l^- since B is column 
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Stochastic. Next, 

w 

Let d = = [di,d2, ■ ■ ■ dwV' ■ Then from Proposition |2l 
1 



J 1 = J 1 
= J 



,-1 



J"ibobJj-i 



B-1D-11h.1^D-1(BT)-i 



9 + g^/p 



B-iddT(BT)-i 



9 + 

p 

■ 

The matrix J corresponds to the information carried by the 
outcomes 1 < j < W only. We expect J to carry more 
information through the knowledge of Nf which gives access 
to j = 0, and therefore J^^ to have reduced uncertainty, 
corresponding (through the CRLB) to a reduced variance. The 
following result confirms this intuition (proof in Appendix [All. 

Theorem 4: An upper bound for J^^(0) is 

Equality holds if and only if bo = Oh^xi- 

The reduction in uncertainty is given by the second term in 

the expression for J~^(0) in Proposition |2] 

G. The Constrained Fisher Information and CRLB 

Intuitively, constraints on the parameters should increase the 
Fisher information since they tell us something more about 
them, 'for free'. In fact, |im shows that this is only true for 
equality constraints. Since we are assuming that Q < 6k < 1, 
the only active constraint is Y^=i^k = 1- Its gradient is 

G{e) = Veg{0) (13) 



where g{e) = J2Y=i &j 



1. 



The inverse constrained Fisher information ifTTI is 

1+ =3-^ -3-^G{G^3-^Gy^ G^i-^ (14) 

where denotes the Moore-Penrose pseudo-inverse lfT2l 
Chapter 20, pp. 493-514] of the constrained Fisher information 
matrix I. The matrix is rank W — 1 due to the single 
equality constraint and is thus singular (see lITT] Remark 2]). 
This somewhat formidable expression can be simplified in our 
case, as we now show. 

Lemma 5: J diag(0i, . . . , = Iw- 

Proof: The row sum of row i of J diag(6'i, . . . , 9w) is 

WW,, w , w w 

0, 



k=lj=0 j=0 ■' k=l 

since B is column stochastic. 



It is easy to see that G = \w Hence J^^G = 3^^1\y = 
diag(^^i, . . . , 9w)'i-w = d from Lemma |5] It is then straight- 
forward to verify that (G^J^^G) is simply the number 1, 
and further that ( fT4b can be reduced to 



(15) 



The remarkable thing here is that the constraint term 69 
depends on 6 only, and so is constant for all sampling matrices 
B, a great advantage when comparing different methods. 

Since we are assuming flows are sampled independently, 
the Fisher information for N flows is just A^J, and the inverse 
becomes X'^ /N. For any unbiased estimator of 0, the CRLB 
then states that 



E[(e-e){e-ey 



> 



N 



(16) 



Because of independence we study = 1. In practice all 
flows are sampled and so N = Nf. 

Remark 6: There is an interpretation to the simple struc- 
ture of the matrix P = J-iG(GTj-ig)-igTJ-i. The 
scalar value I^.X^Ivk is equivalent to Var(^,^j 9i). By 
the equality constraint, we expect the best estimator of 
X^il^i iic^ve a variance of 0, since the estimator already 
knows that = 1- This corresponds to a CRLB of 

zero, namely l^X+lw = l^J^^liv - l^-OO^lw ~ 
l'^diag{0i,..., ew)lw -1 = 0. Thus, P = 69^ is the 
form of the correction term to the unconstrained covariance 
matrix J^^ needed to satisfy the constraint. 



III. The Sampling Methods 

In this section we define the sampling methods we consider 
and derive their main properties. We begin with methods 
which have been described elsewhere, including simple packet 
and flow sampling, as well as others exploiting protocol 
information, in particular those proposed in ||4|, |[T]. Apart 
from their inherent interest, we revisit these because in the 
unconditional framework these methods are now all different 
to before. More importantly, we also derive inverses analyti- 
cally which has not been possible before, and thereby obtain 
a number of important insights. We also include the widely 
cited 'Sample and Hold' |6| whose Fisher information has not 
previously been studied. We then introduce our new method. 
Dual Sampling. 

To better see the connection between the usual framework 
and ours, recall that bjk is always a conditional probability 
with respect to the size k of the original flow. Typically 
however, it is also made conditional with respect to j, but we 
do not so here. Hence, if Be is the usual j -conditional matrix, 
then BcC = B where C = 1^/ — diag(&oi, ■ • • , bow), i-S- the 
matrix C^^ does the conditioning. 

We use the decomposition of (|9]l to describe each sampling 
matrix B. In each case we define B and B, give the inverse 
B~^ of B, and give explicit expressions for the diagonal terms 



(J- 



or in some case for the entire inverse J 



The 



importance of the diagonal terms will become very clear in 
Section HV] 
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A. Packet Sampling (PS) 

By this we mean the simplest form of sampling, i.i.d. packet 
sampling, where each packet is retained independently with 
probability pp and otherwise dropped with qp = l~Pp- For the 
purpose of simplicity, we treat both '1 in A^' periodic sampling 
and i.i.d. random sampling under the same framework, as both 
methods were shown to be statistically indistinguishable in 
practice [[T]. 

The chief benefit of PS is its simplicity, and the fact that it 
can be implemented at high speed because a sampling decision 
can be made without even inspecting the packet. The chief 
disadvantage is the fact that it has a strong length bias, small 
flows are very likely to evaporate. 

It is easy to see that bjk = I . )pj,qp~'', or 
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Before finding the inverse of B, it is first instructive to note 
the following general results by Strum ifTSl . Let B{x, y) be an 
[W + 1) X {W + 1) matrix with the following structure 



B{x,y) = 



X 



X 







W 

w 







,w 



which is known as the binomial matrix. Note that S(0, 1) 
reduces to the identity matrix. From ifTSl we have 

Lemma 7: If y ^ 0, then B{x, y) is invertible and 

[B{x,y)]-^ ^B{~xy 



)• 



Using these results we can find the inverse of B (recall that 

= (B-i),fc.) 

Theorem 8: The inverse of B is given by 



B 



Pp 




1 -2p-\ Zp-^ql ••• Wp-^^^-qp^-^- 
Pv^ -^Pp^Qp ■■■ '■ 



-w 



that is h'- 



i-^f-'Q^-'p-p 
= [qp, 

B{qp,Pp) 



Proof: Here bj = [^p, g^, ... q^]. We have 



1 
Ow 



B 



Since B{qp,Pp)[B{qp,Pp)] ^ = Iiv+i, for some k we have 







■ 1 k ■ 






Ow B 




Ow B \ 




Ow iw_ 



Furthermore, from Lemma |7] we have [B{qp,Pp)]^^ = 
B{—qpPp^,Pp^). Thus B~^ is essentially a principal W xW 
submatrix of B{—qpPp^,Pp^). ■ 
For PS J^^ is difficult to write in a compact form and will 
be omitted. It is however feasible to give using equation ( fT2b 



w 

k=j 



EW ni, -2k 7 
fc=o qp Pp dk 

The above form is derived in Appendix ICl 
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B. Packet Sampling with Sequence Numbers (PS+SEQ) 

First PS with parameter pp is performed as above. Se- 
quence numbers are then used as follows. Let s; be the 
lowest sequence number among the sampled packets, and s/j 
the highest. All packets in-between these can now reliably 
inferred, hence j = s/i — s/ + 1 is the number of sampled 
packets returned. This is called "ALL-seq-sflag" in ||4|- 

The chief benefit of PSh-SEQ is the fact that a potentially 
large number of packets can be 'virtually' observed without 
having to physically sample them. The disadvantage is the 
additional processing involved to perform the inference. Also, 
the technique is of limited value if flows are too short (as we 
discuss later). 

If j = 0,1 then sequence numbers cannot help and bjk is 
as for PS. Otherwise, note that the j packets must occur in a 
contiguous block bordered by si and Sh- There are k — j + 1 
possible positions for such a block, each characterized by k—j 
unsampled packets outside it and the borders s; and Sh- It 
follows that bii, ^ (k — j + l)p'iq^^^ for 2 < j < k. Hence 
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Theorem 9: The inverse of B is 
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is a W X W matrix and 
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A straightforward computation yields 
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and S 1 = diag(pp\pp2 Thus, B ^ 

rp-ig-i^ which proves our resuh. 

The diagonal elements of J^^ are given by 



(J ^)ii = Pp ^di + Aq^pp + qpPp ^d^ 
(J- 



)22 



Pp d2 



MpPp da 



IpPp di 



{3-^) J J = Pp^d, + mIp;%+i + qtPp%+2, S<j<W 

where r — do + qpPp^di + qpPp^d2, and for convenience we 
set dj = for j > W. 



C. Packet Sampling with SYN Sampling (PS+SYN) 

First PS with parameter pp is performed as above. A 
post-processing phase then discards all packets belonging to 
sampled flows which lack a SYN packet (or more accurately, 
maps them to sampled flows with j — 0). This was introduced 
in m] and called "SYN-pktct" in E). 

The chief benefit of PS+SYN is fliat the flow length bias 
of PS is averted by keeping flows based on the presence 
of the SYN, which is flow length independent. The chief 
disadvantage is the fact that it is wasteful: if pp = 0.01 then 
99% of packets which were initially sampled belong to 'failed' 
flows and are subsequently discarded! 

A flow evaporates iff its SYN is not sampled, hence 
^ofc = Qp- For j > 1 the SYN must first be sampled, which 
occurs with probability pp, and conditional on this j — 1 more 
packets must be sampled from the remaining fc — 1 using i.i.d 

/fc — 1 

sampling. Hence bjk — Pp 



p>p ^ql ^ for j > 1, 
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Theorem 10: The inverse of B is given by 
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Proof: Note that we can express B in terms of 
a W X W binomial matrix B{x, y) such that B = 
PpB{qp,Pp). Then by Lemma |71 the inverse is given by 
B = {l/pp)B{-qpp-\pp'). m 

It is easy to see that the condition of Corollary [3] is satisfied 

and the diagonal 



with p = Pp. Hence J^^ = J^^ - ^00 

pp 

entries for 1 < j < W aie 

w 
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2'=4 - :«i 



j ■ 
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D. Packet Sampling with SYN and SEQ (PS+SYN+SEQ) 

First sampling is performed according to PS+SYN with 
parameter pp, and then on each resulting sampled flow the se- 
quence number post-processing is performed as per PS+SEQ. 
This is cafled "SYN-seq" in |4|. PS+SYN+SEQ is a hybrid 
" of PS+SYN and PS+SEQ and combines the advantages and 
disadvantages of both. 

If j = 0, 1 then sequence numbers cannot help and bjk is as 
for PS+SYN. Otherwise, by combining the arguments above, 
it is easy to see that bjk = Pp ■ Ppqp~^ for j > 1, giving 
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Theorem 11: The inverse of B is given by 



B-^1 

Pp 



Proof: A straightforward computation shows that 
BB-i=Ih.. ■ 
Since bo = qplw, Corollary[3]applies and states that J^^ = 
— ^66^ . The diagonal elements can be explicitly written, 
but we defer this to Section IIII-HI 



E. Flow Sampling (FS) 

In i.i.d. flow sampling |[3], flows are retained independently 
with probability pj and otherwise dropped with qf = 1 — pf. 

The chief benefit of FS is the fact that flows which are 
sampled retain their full complement of packets, eliminating 
completely the difficulties in inverting sampled flow sizes back 
to original sizes. The chief disadvantage is that each packet 
requires a lookup in a flow table to see if it belongs to be flow 
which has been sampled. 

A flow evaporates iff its SYN is not sampled, hence bok = 
qj. If a flow has been selected, which occurs with probability 
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Pf, then conditional on this j 
1 if j = fc, else 0, for j > 1: 



k with certainty, that is bj^ 
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The inverse of B is just B ^ = Iw/pf, and J takes the 
elegant form J(0) = qjlwlw + P/diag(6'f \ . . . , 0^). 
Clearly Corollary |3] applies with p — pf. Since 



'^W )' 



J = pfdiag{9^ , . 
the unconstrained inverse can therefore be expressed as 



3-\9) = — diag(0) + (1 - —)ee^. 

Pf Pf 



Remark 12: By using equations (fTSl l and jTH . the inverse 
constrained Fisher information matrix for FS is given by 



1 



1 



x+ = — diag(6») - —ee ^, 

Pf Pf 

with the diagonals being {X'^)kk ~ ^ Sk). This is 

just an appropriately scaled version (by l/pf) of the inverse 
Fisher information of a multinomial model. This makes sense 
given that FS works by picking out whole flows from the orig- 
inal flow set in an i.i.d fashion and that the complete likelihood 
function can be modeled using a multinomial model. 

F. Packet Sampling with SYN, FIN,SEQ (PS+SYN+FIN+SEQ) 

In this scheme SYN packets are retained independently with 
probability pf and otherwise dropped with qj = 1 ~Pf, and the 
FIN packets corresponding to sampled SYNs are also sampled, 
but no others. Sequence numbers are then used to infer flow 
sizes. 

This scheme has two great advantages: like FS the flows 
sampled are sampled perfectly, and moreover this could be 
achieved by physically sampling only two packets per flow, 
based on looking for SYN and FIN flags on a per packet 
basis, which is feasible at high speed. The disadvantage is that 
a moderate minority of flows do not terminate correctly with 
a FIN, and/or the FIN may be not observable. Furthermore, 
flows consisting of a single SYN (such as in a SYN attack) 
would be entirely missed. For this reason we choose not to 
study it further. 

Information theoretically, PS+SYN+FIN+SEQ is identical 
to flow sampling provided we assume di w 0. 

G. Sample and Hold ( SH) 

Here packets are first sampled as for PS with probability 
Pp, however for each flow if a packet is sampled, then all 
subsequent packets in the flow will be. Hence the total number 
of packets sampled is much higher than the parameter pp. The 
scheme was introduced in j?). 

The chief benefit of SH is that provided just a single packet 
from a flow is PS-sampled, then typically many will be finally 



sampled. This conditional behaviour is much more effective 
than methods using SEQ where at least two packets must be 
PS-sampled, and even then fewer packets are finally recouped. 
Essentially SH skips a geometric number of the first packets 
in a flow and then captures all the rest. It therefore efficiently 
skips small flows and accurately recovers the size of large 
flows. This amplified flow length bias (even stronger than for 
PS) makes it well suited for the heavy hitter problem (i.e. ac- 
curately measuring the very largest flows) for which it was 
originally designed. For flow size estimation more generally 
however, it is a disadvantage for most flow sizes. The other 
disadvantage is the need to check, for each packet, whether it 
belongs to a flow which has already been sampled. This makes 
it very costly in a true sampling implementation. Indeed Estan 
and Varghese implemented it using lossy sketching techniques 



(21) A flow evaporates iff none of its packets are sampled, hence 
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It is interesting to note that the first row is as for PS, the 
second as for PS+SYN, and subsequent rows like PS+SYN 
scaled by 1/pp. 

Theorem 13: The inverse of B is given by 
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Proof: It is easily verified that BB~^ = ■ 
It is not difficult to show that Proposition |2] reduces to 
J^^ = J^^ — C, where the only non-zero element of C is 

Cii = Pp^q^dKpldo + q^diy^ = do/pp since di = ^do- 
Furthermore, using ( fTTl i one can show that J^^ is tridiagonal 
(and symmetric) with upper off-diagonal terms (J^^)fc,fe+i = 
—Pp'^Qpdk+i, k < W, and diagonal elements 
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where we have used the property of B that dj = pp9j+qpdj+i 
for 1 < j < - 1. 

H. Dual Sampling (DS) 

DS can be defined simply as follows. First, at the packet 
level it consists of two PS schemes running in parallel, one 
which operates only on SYN packets with sampling probabil- 
ity pf, and the other only on non-SYN packets with sampling 
probability pp. In a second phase, sampled flows which lack 
a SYN are discarded, and sequence numbers are used to infer 
additional 'virtual' packets, as in PS+SYN+SEQ. Thus, at one 
level DS is simply a generalization of PS+SYN+SEQ, and 
reduces to it when pf = pp. However, the generalization is 
significant as it also includes FS as the special case pp = 1, and 
interpolates continuously between the two. This is illustrated 
in Figure |2] which depicts the {pp,Pf) parameter space, and 
marks the special cases. 

Dual Sampling is 'dual' in two senses. Computationally 
it can be viewed as the original PS sampling being split 
into two, at the (low) cost of per-packet switching based on 
some bit checking to determine which PS applies. Information 
theoretically, the sampling is now split into two parts with very 
different natures, each controlled by a dedicated parameter: 
a FS-like direct sampling of flows, and a PS-like in-flow 
sampling. Here pf controls the number of sampled flows, and 
Pp their 'quality'. 

The derivation of the sampling matrix mirrors closely that 
of PS+SYN+SEQ. The result shows clearly how pf and pp act 
in a modular fashion. The flow sampling component controls 
the top row of B and factors B, whereas the packet sampling 
component determines the internal structure of B. 
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The separation of the FS and PS roles in B is clearly reflected 
in its inverse: 
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The similarity between B^^ for SH and DS is striking. 
Indeed, whereas SH uses sampling to select which flows it will 
focus on and then holds to them, DS reverses these operations 
and thus could be described as a 'Hold and Sample' scheme. 
However, although they each combine PS and FS features, 
the methods remain significantly different. In particular SH is 
strongly flow length biased whereas DS is unbiased. 

Once again Corollary |3] for the inverse Fisher information 



holds, showing that J 



^6/0 \ Similarly to SH, 




Fig. 2. Pai'ameter space {pp,pf) of DS. Fixing ESR= p constrains the 
family to tlie solid (blue) curve pf{pp\p), where it reduces to PS+SYN+SEQ 
at Pp = p* and FS at pp = 1. Fixing PPR= p constrains the family to 
a straight (green) line emanating from {0,pD). For fixed p, the constraint 
curves depend on D. Three examples are given for each normalization. 



from (fTTT i one can show that J ^ is tridiagonal with upper off 
diagonal terms {3-^)kM+i = -{PfPp)~'^qpdk, k < W. The 
diagonal elements (here 2 < j <W — 1) are given by 



(J-')ii 



PfPp 



w 



(J- 



133 



- + — E'^r^^---^? 
Pf PfPp 

-h(d, + qld,+,)-^e^ 



(25) 



PfPp 
O3 



1 



w 



PfPp PfPp ktjtl 



Pf 



(J-1) 



iw 



WW 



p]pI Pf 



w 



ow 
PfPp 



Qfn2 

Pf 



(26) 



By setting pf — pp we obtain those for PS+SYN+SEQ. 

IV. Comparisons 

In this section we compare and contrast the performance 
of the different methods, using two normalizations which 
are the key to a fair comparison. We show that a positive 
semidefinite comparison holds for certain cases, and justify 
why we ultimately resort to comparison of the diagonals of 
the covariance matrix. We provide a partial ranking of the 
methods. Proofs of most key results in this section are deferred 
to Appendix ID] 

A. Normalization 

We must first consider how to compare fairly. We do this 
in two ways, using the following packet-based measures of 
incoming workload/information. 
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• Packet Processing Rate (PPR): the rate at which packets 
are initially being sampled (and hence require some 
further processing). 

• Effective Sampling Rate (ESR): the rate at which packets 
are arriving to the flow table (and hence become available 
as information for estimation). 

PPR is a measure of the processing speed required by the 
methods, whereas ESR is a measure of the arrival rate of 
packets containing information which is actually used by the 
method. Because the methods which discard flows without 
SYN packets lose (the majority) of their packets, an equal 
PPR comparison greatly disadvantages them. An equal ESR 
comparison effectively inflates the parameters of such methods 
to compensate. 

Note that given a sampling method X, both the PPR and 
ESR normalizations of X+SEQ and X are identical, as the 
methods only differ in how the packet data is used, not how 
many packets are physically collected. We now present the 
normalizations for each method. Here and below we use D = 
SfeLi ^^'fe > 1 to denote the average flow size. 

Table Ugives the results for PPR. Both the average sampling 
rate p as a function of the method parameter(s), and its inverse, 
the parameter value required to set the average sampling rate 
at p, are shown. The expression p{pp, 9) for SH was derived 
by computing the average number of packets sampled (which 
is highly dependent on 6) and dividing the result by the 
average flow size D. It is monotonically increasing in pp with 
0) = 1, and p/pp ^ J EZ, ^k/D > 1 as ^ 0, 

which in the worst case {9w = 1) becomes p/pp = (l+W) /2. 
We invert p{pp, 9) numerically as required for choices of 9, 
denoted by root(p(pp)) in the tables. DS has two parameters. 
We choose to make pp the independent one. 

Table HIl gives the results for ESR. Only the methods which 
involve discarding packets after their initial collection have 
changed compared to Table U Note that for PS+SYN p/pp — ^ 
l/D > 1 as Pp 0, and p{l) = 1. 

We are particularly interested in the ESR case for the DS 
family, and so repeat its ESR normalization here: 

pD 



Pp{D-l) + l- 



(27) 



The denominator pp{D — 1) + 1 is simply the average sampled 
flow size, conditional on its SYN being sampled. For fixed p, 
this equation gives pf as a monotonically decreasing, in fact 
convex, function of pp. Three examples (the blue curves) are 
given in Figure |2] for different values of D. To maintain an 
ESR fixed at p, if we increase pp we must move down the 
curve and decrease pf to compensate. The PPR case is similar 
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TABLE II 

Average ESR sampling rates and normalizations. 



but simpler as the level curves are straight lines (green lines 
in Figure Note that depending on p and D, under both 
PPR and ESR there are values of pf and/or pp that may be 
disallowed. 

To compare methods we also require a performance metric. 
A natural criterion is the set of diagonal elements of the CRLB, 
the k-th being 

(X+)fefe = (J-i),,-02, (28) 

since this is a lower bound on the variance of any unbiased 
estimator of 9k. We plot the square root of these values, 
calculated using the expressions for the previous section. 
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Fig. 3. An equal PPR comparison of the CRLB bound for with p = 0.005. 
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Fig. 4. Dependence of the CLRB bound on p, PPR comparison. 
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We begin with some instructive examples. Consider the 
resuhs of Figure [3] where we set W ^ 5 with 

e = {0.22,0.21,0.20,0.19,0.08}, 

for which D — 2.90 packets. We use the PPR normalization 
with p ~ 0.005, corresponding to pp « 0.0021845 for 
SH. For DS we give two examples satisfying p: {pf,pp) = 
(0.001,0.0443), and (p/,Pp) = (0.1,0.00334). 

As expected from earlier work, the performance of PS is 
extraordinarily poor In agreement with the results of |]4| and 
as expected, the inclusion of SEQ improves it enormously, 
by orders of magnitude, but it is still orders of magnitude 
behind FS, which has the lowest standard deviation bound 
of all. In a highly counterintuitive result, PS+SYN performs 
(much!) better than PS, despite the enormous number of 
wasted packets. We can offer two reasons for this. First, in 
the unconditional framework information in discarded flows 
is not entirely wasted, some is recouped by the (observable) 
increase in the j = outcomeQ. Second, SYN sampled flows 
(much like FS) have no flow length bias. Together these lead 
PS+SYN to perform better in most cases even under PPR. 
Three results for DS are given. From best to worst these were 
with pf = 0.001 < p, the special case PS+SYN+SEQ with 
Pf = p, and Pf — 0.1 > p. Finally SH performs well in this 
example as its key disadvantage, a strong bias to large flows, 
does not play a large role for such a small value of W. 

Using the same PPR based comparison. Figure |4] shows the 
improvement in CRLB as p increases, as we expect. We see 
that very high values of p are needed before the performance 
of FS is approached. Here the free parameter pf for DS 
was chosen to guarantee meaningful examples to either side 
of PS+SYN+SEQ. Specifically, the normalization defines for 
each value of p a feasible range for pj which includes 

Pf = p. For DSl we set pf = p — c{p — pj) < p, and for DS2 
Pf = p + c{pj — p) > p, where c G [0, 1] . In the figure we use 
c = 0.1. 

The great merit of the ESR normalization is that it allows 
us to view the methods as being different ways in which the 

'It was pointed out in (4], under a conditional framework, that 
PS+SYN+SEQ outperforms PS+SEQ on actual traces. Our own numerical 
evaluation of the CRLB of these methods (under the conditional framework) 
also show this in some cases, leading us to conjecture the same counterintuitive 
results may hold for the conditional framework. 
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Fig. 5. An equal ESR comparison of the CRLB bound for with p = 0.005. 



same budget of sampled packets can be allocated among flows. 
The essential tradeoff is that we can have more sampled flows 
of poor quality, or fewer of higher quality. In a class with no 
flow length bias, flow sampling is at one extreme: for a given 
ESR= p it gives the minimum number of sampled flows, each 
of perfect quality. Among sub-classes of methods with roughly 
the same number of flows, we can then distinguish between 
the finer details of how the 'holes' appear over the flow, which 
will have an impact on the degree of improvement which the 
sequence numbers can bring. 

Results using the same scenario as Figure [3] but ESR 
normalized are shown in Figure|5] As expected, all SYN-based 
methods improve significantly. In particular the DS with the 
smaller pf = 0.001 now outperforms SH, and is second only 
to FS. 

Another example with a larger = 50 and a truncated 
exponential distribution for with D = 16.039 is given 
in Figure |6l The same general conclusions hold, with the 
exception of SH, whose performance is now worse than 
PS+SYN+SEQ rather than considerably better This is to be 
expected at larger W, as SH expends its packet budget on the 
rare largest flows. At realistic W values in network data this 
effect is further exaggerated. 

It is interesting to note in the right plot of Figure |6] that 
variance decreases as k increases. This is consistent with the 
fact that the ambiguity inherent in an observation of j sampled 
packets is lower for larger j and disappears at j — W, since 
this observation can only arise in one way. The dip at very 
small k we attribute to the influence of the constraint. 

B. Positive Semidefinite Comparisons 

The examples above compared methods using the CRLB of 
each Ok separately. Let two methods have Fisher information 
Ji and J2. A more complete, and in a sense ideal comparison, 
would be to show that Ji > J2, since that would imply 
Xl < Xj, a lower CRLB. What this positive semidefinite 
comparison really means is that for any linear combination 
f{6) ~ aJO = o-k&k of the parameters, the (bound on 
the) variance of f{6) under method 1 will be less than that 
under method 2. A geometric interpretation is that the ellipsoid 
corresponding to unit variance of vectors under Ji lies entirely 
within that of J2. In this section we provide a number of 
comparisons of this type between methods, and explain why 
it is not suitable as a universal basis of comparison. 

We first confirm the intuition that methods lose information 
monotonically in their sampling parameter(s). 

Theorem 14: For any method Z surveyed here if pi > p2 
(for DS ppj > pp,2 and p/^ > p/,2) tiien 3z{pi) > Jz(p2) 
(for DS Jz(Pp,i,P/,i) > Jz(Pp,2,P/,2))- Equality holds iff 
Pi =P2 (forDS (pp,i,p/,i) = (Pp,2,P/,2))- 

The proof (see AppendixlDb. is a straightforward consequence 
of closure under sampling for all methods except SH. 

Next, we confirm that the use of sequence numbers increases 
Fisher information. 

Theorem 15: Jz+seq > Jz for each of Z ^PS and 
Z =PS+SYN. 
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Fig. 6. Comparison of the CRLB bound with W = 50. Left: PPR with p = 0.005. Pf = 0.001, Pp = 0.0052 for DS (pf < Pp) and pf = 0.01, Pp = 0.0047 
for DS (pf > Pp). Middle: ESR with p = 0.005. pf = 0.032, pp = 0.1 for DS {pf < Pp) and pf = 0.079, Pp = 0.001 for DS (pf > pp). Right: Zoom 
of middle plot, the same legend applies. 



Proof: We make use of the data processing inequality 
(DPI) for Fisher information (see Theorem|3T]in AppendixlBb. 
which states that if — y — X is a Markov chain, where X 
is a deterministic function of Y, then Jy(0) > 3x{S) which 
holds with equality if X is a sufficient statistic of Y . We first 
consider PS+SEQ and PS. Flows are selected randomly and are 
represented as a random vector of SEQ numbers V = {^4, A+ 
1, . . . , A+K~l} with realization v = {a, a+1, . . . , a+fc— 1}. 
Here, K represents the flow size (realization k) while the A, 
the initial SEQ number (realization a) is uniform over [0, Na~ 
1]. All operations on them are modulo Na, thus allowing wrap- 
arounds. A is independent of K. We also assume that W < Na 
to avoid problems with multiple wrap-arounds. 

After the PS process, we have the SEQ vector Y ~ {Yi, 1 < 
i < J}, (realization y = < i < j})- We define 

two statistics on Y: J(Y), the actual number of raw packets 
sampled, and L{Y) — ma.Xi{Yi) — mmi{Yi) + 1 (with a 
sample l{y) and L(Y) = if Y is empty), the inferred 
length of the sequence. The statistic L{Y) disregards A since it 
takes the difference of SEQ numbers. Note that both statistics 
are deterministic functions of Y and so form Markov chains 
e ^ Y ^ L{Y) and ^ Y ^ J(Y). A straightforward 
application of DPI yields Jy > Jl(y) ™d Jy > J,/(Y)- 

We show that L{Y) is a sufficient statistic of Y w.rt. 6. 
We use the Fisher-Neyman factorization theorem lfT4l Theorem 
6.5, p. 35] which states that if the probability mass function 
of Y takes the form 

Pr(Y = y) =h(y)g(/(y),0), (29) 

where h is independent of the parameters 6, then L{Y) is a 
sufficient statistic. Let > denote the number of unsampled 
packets before the first sampled packet. If J = j > 0, 

Pr(Y = y) 

W k-l 

k — l 771 — 

W k-l 

= Pr{A^a)Y,0,Y.q'-^pj, 

k=l m=0 



since A is uniform. For the case j = 0, 

w 

PviY = 9) = J2<lpOk. 

k=l 

Clearly, each case satisfies ( |29] l. hence L{Y) is a sufficient 
statistic of 9. 

By the DPI, since L{Y) is a sufficient statistic, Jy = 
Jl(y)- From the previous relation Jy > J./(Y)- we now have 
Jl(y) > J./(Y)- But J(Y) is equivalent to the statistic used 
in PS, proving the result. 

As for PSh-SYN sampling, we define an additional random 
variable S taking value 1 if the SYN packet was sampled, 
otherwise. V is defined as before. Y = if and only if S* = 0. 
The same statistics L{Y) and J(Y) are defined. Then, for 
J = j > 0, 

w 

Pr(Y = y) = ^ Ok Pt{A = yMpf'q'-nPpQp-' 

k=l 

W 

= PT{A = a)Y,ekql-'pl 

k=l 

and for j = 0, 

w 

Pr(Y = 0) = ^gp0fc = gp. 

k=l 

Once again, each case satisfies ( |29] l. hence L{Y) is a sufficient 
statistic. Thus, by DPI, the same relationship Jl(y) ^ Jj(y) 
holds, with J(Y) now equivalent to the statistic used in 
PSh-SYN. ■ 
Intuitively this result seems obvious: if the additional in- 
formation afforded by the sequence numbers is available, we 
should certainly be able to do better by using it. However, it 
is tempting to conclude that by the same logic Jps > Jps+SYN 
under the PPR normalization, since deciding to discard flows 
without a SYN is also a deterministic transformation. How- 
ever, the data processing inequality does not apply here 
because some of the information used by PSh-SYN (namely 
the SYN variable S), although available to PS, is not used by 
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it. Indeed we saw in the figures above that, counterintuitively, 
PS+SYN can actually outperform PS in terms of the individual 
variances under PPR, which is a counter-example to the more 
general positive semidefinite comparison. Under ESR this is 
even more true as we saw from the middle plot in Figure |6] 
We now give another, even more surprising counter-example 
which teaches an important lesson. 

Theorem 16: Jfs ^ Jps 

Proof: Proof is by contradiction via counter-example. Let 
W = 2 with 9i = 02 = 1/2, and let pp = pf = p. Evaluate 
1J(Jfs — Jps)l2, (the sum of each element of the matrix 
difference), which we expect to be nonnegative by assumption. 
It can be shown that this reduces to 

^ 2g(l + g^) ^ l + 3g-4g3 ^ 
1 + q l + 2q * 

which can be negative, e.g. when q = 1/2, a contradiction. ■ 
Using Lemma l27l" ii) one can show that Jfs ^ Jps implies 
Jps ^ which in turn implies ^ ^ps (see Lemma 
[26] l. Since we earlier showed that PS is enormously worse 
than FS for variances, this result is surprising, particularly as 
experience shows that it is not difficult to find other examples 
for much larger W. We can explain this quandary as follows. 
When we focus on the variances in isolation we highlight 
the poor performance of PS. However, linear combinations 
such as J2k ^kSk bring in cross terms, which for PS are 
negative because of strong ambiguity in its observations. 
Strong correlations are a bad feature of a covariance matrix of 
an estimator, however if they are negative, they can in some 
cases cancel other positive terms resulting in a lower total 
variance. Hence, paradoxically, it is the poor behaviour of PS 
which prevents Jfs > Jps from holding. 

The last example reveals that a ranking of methods based 
on a positive semidefinite comparison, although very desirable 
when true, is not possible in general. We therefore turn to a 
more generally applicable approach in the next section which 
we use for the remainder of the paper 

C. Variance Comparisons: a Partial Ranking 

In this section we focus on comparing methods via the diag- 
onal elements of X^, corresponding to the optimal variances 
of 6i estimates, just as in the figures above. As before, it 
is sufficient to consider the unconstrained covariance matrix 
J^^ since = J^^ — 90^ by ( fTsl l. To the extent possible, 
we seek to establish a hierarchy between methods based on 
diagonal comparisons. Thanks to Theorem [15] which applies 
in particular to diagonal elements, we already have the answer 
in some cases. 

We begin by comparing PS and PSh-SYN. The examples 
given so far showed PSh-SYN as enormously superior to PS. 
However, this is not always the case. A counterexample under 
the PPR normalization with p = 0.05 is given by 

e = {01, 01, 03<fe<io = 0i/lOO}, 01 =0.4808, (30) 

for which D w 1.69. As seen in Figure |2l PS outperforms 
PSh-SYN for Oq and 0io, but not otherwise. This makes sense 
given the bias of PS toward sampling large flows. 



For these same parameters PS+SYN outperforms PS for all 
Ok under ESR. This is typically the case. For example, with the 
above parameter set, PS+SYN outperforms PS on all sampling 
rates. While it may be true that PS outperforms PS+SYN under 
ESR in some rare situations, we believe this is not the case, 
and conjecture that PS+SYN defeats PS under ESR for all 0. 
For small p, we can prove that this is the case. 

Theorem 17: Under PPR and ESR normalization, for small 
enough p, (Jps+sYN)ij < (Jps^)ij for any 6», j = 1, . . . , W^. 
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Fig. 7. PPR comparison between PS and PS+SYN with pp = 0.05 for a 
particular distribution highly skewed to small flows. 

We now consider the above comparison after both methods 
have benefitted from the use of sequence numbers. 

Theorem 18: Under PPR and ESR normalization, for every 

3 < j < VK, (Jps+SYN+SEQ )ii — (JpS+SEQ)ii- 

For each of j = 1 and 2 counterexamples can be found 
for certain 6 and p, under both PPR and ESR. For example 
PS+SEQ has lower variance for 9i for the parameter set (l30l l 
with p = 0.05 under both PPR and ESR, and for 02 the family 
6 = {a, 1 — 2a, a} with a < 0.1 gives examples for wide 
ranges of p, including as p — !> for a small enough. The 
counterexamples occur in atypical situations where 0i or 02 
are large relative to other dj, and can be excluded if these are 
appropriately controlled. For example if p < 1/2 it can be 
shown that when 02/03 < Qp, then PS+SEQ is worse under 
PPR (and hence ESR). Given that PS+SYN+SEQ defeats 
PS+SEQ except for perhaps 0i or 02 under atypical conditions, 
in general PS+SYN+SEQ can be regarded as superior. 

The picture emerging from the above comparisons is that 
PS+SYN+SEQ is clearly superior to PS and PS+SEQ. Since 
PS+SYN+SEQ is just a special case of DS, we now consider 
this family in more detail. We focus on the ESR normalization 
which, apart from being far more important than PPR, is the 
key to our optimality result in Section [Vl] The following is 
one of our main results, a detailed characterization of the 
performance of DS under ESR. It can be shown that an 
analogous result does not hold for PPR. 

Theorem 19: The diagonal elements 2 < j < of 3^^^ 
under the ESR normaUzation are monotonically decreasing in 
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Pp. The property holds for j = 1 iff the condition 

O2 > ^^0,(1 - di) (31) 
is satisfied. Also, monotonicity holds when D = 1 or W = 2. 

The above result shows, provided ( |3TI ) is satisfied, that the 
variance bounds for each 9k under DS drop as pp increases. 
It follows that the optimal point Pp in the DS family lies at 
Pp = 1, that is flow sampling! The superiority of FS clearly 
demonstrates that the problem of inverting from imperfect 
sampled flows is so difficult that in the tradeoff between flow 
quality and quantity, quality wins convincingly. 

The exception is for 61 when dsTI ) fails, in which case the 
optimal DS lies to the left of pp = 1. However, this only 
occurs when 62 / Oi is extremely small, and even then in most 
cases Pp « 1 unless D is also very large. Consider the region 
Pp ^ 1. By solving for the optimal p* using (HTt . we obtain 
(provided D > 1). 



For p* to be much smaller than 1 , we require D to be large and 
the ratio 02 /9i very small (effectively, 02 being negligible). 
Such a scenario is very unlikely to occur in networks and in 
any case only affects 9i, and so it is reasonable to assume that 
the optimal DS corresponds to FS in practice. 

Our next result compares the final method, SH, against FS. 

Theorem 20: Under PPR and ESR normalization, for every 
2 < j < W, — ("^m)]]- P*^'" J — 1 the condition 

holds for any 9 for p sufficiently small. Sufficient conditions 
independent of p are either = 2, or for W > 2, 

02>Oi{l-9i). (33) 

The proof for (|33] | involves bounds which are tight as p — > 1. 
Hence, when the condition is violated and p is large enough, 
SH can indeed outperform FS for 9i. An example is given 
in Figure |8] with W = 6 and Pp = 0.4351, which is large 
and reasonably close to p = 0.5. However, this effect is 
difficult to see in distributions with larger W as SH shifts 
most of its packet budget to larger flows, resulting in pp being 
orders of magnitude smaller than p and the bounds leading to 
the sufficient condition becoming very loose. In all cases of 
interest therefore, FS can be regarded as superior to SH at all 
flow sizes. 

Finally, we clarify the relationship between DS and SH more 
generally. DS is a two parameter family whose performance 
varies in a wide range. Indeed, Figure |5] already shows that it 
can perform better than SH at the 'FS' end of its range, and 
worse at the other end. The important observation is that in all 
cases where FS outperforms SH for a given 6, it follows from 
continuity of the CRLB in {pp,pf{p)) that there are members 
of the DS family other than FS which do likewise. Theoreml20l 
shows that this is true in almost all cases under ESR, allowing 
us to conclude that in general DS outperforms SH at all flow 
sizes. 



The natural counterexample to this general rule is when 
(I3TI 1 is satisfied so that the best DS can do is given by FS, 
and yet 9 is such that SH defeats FS (for 9i), implying that 
SH defeats all members of the DS family for 9i. This can 
occur if (l33T l is not satisfied and p is large. Furthermore there 
are cases where both SH and the optimal DS defeat FS, in 
which case the comparison is more complex. Since however 
this battle is contained within a very small and unimportant 
region of {6,p,pf{p)) space, we do not characterise these 
counterexamples fully but instead conclude with the following 
result, which for the first time exhibits specific DS members 
(other than those in a neighbourhood of FS) which defeat SH. 

Theorem 21: Under ESR normalization with rate p, if < 
Pp.SH < ^D-i ' ^^^^ ^ sufficient condition for (Jps)jj < 
(JsH )jj f"'" every 1 < j < is that Pp, ds satisfies 

Pp.SH < Pp,DS- 

Summary: We have confirmed that a ranking of methods 
based on a positive semidefinite comparison is impossible, 
since even a diagonal comparison does not yield a simple 
hierarchy. Nonetheless, ignoring the counterexamples which 
sometimes occur for the CRLB variance for 9i and perhaps 92, 
a clear overall picture emerges from the comparisons above. 
First, from Theorem [15] the utilization of sequence numbers 
yields consistent information theoretic dividends. In particular, 
PS need not be considered further since it has extremely 
poor performance and by Theorem [15] PS+SEQ is better. 
Theorems [17] and [18] show that the use of SYN sampling as 
a technique to eliminate bias is very powerful, which makes 
PS+SYN+SEQ the leading candidate, and focussed attention 
on its generalization, DS. Theorem [19] then shows that DS 
out-performs PS+SYN+SEQ under the ESR normalization 
(the most relevant one for estimation performance) provided 
Pp > Pp, and furthermore that FS is the favored member of 
DS. Theorem [20] shows that FS is also superior to SH. In 
conclusion, FS is the best sampling method for all flow sizes. 




1 2 3 4 5 



Flow Size k 

Fig. 8. PPR and ESR comparison between FS and SH with p = 0.5 for a 
particular distribution highly skewed to small flows. 
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This strongly motivates the adoption of sampUng methods 
which approach FS as closely as possible, but which are 
efficiently implementable. This refocuses attention back to DS 
as it enjoys both of these properties. 

V. Analysis of the SEQ Dividend 

The utilization of sequence numbers provides additional 
'virtual' packets for free. The extent to which this improves 
our information on clearly depends on parameters. For 
example, flows with only fc = 1 or 2 packets are not helped at 
all. Intuitively (for example, under PS) it is the flows for which 
A; ^ 1 /pp which will receive the most benefit: since only the 
packets before the first and after the last physically sampled 
packets will not be recovered using sequence numbers, and on 
average there are 2/pp of these since \/pp is the average gap 
between two physically sampled packets within a flow. 

To quantify the benefit of sequence numbers better we 
define the effective packet gain, which is the ratio r = 
E[7Vfe]/E[iVfc], where Nk and Nk are the number of SEQ 
assisted and unassisted (physical) packets sampled respectively 
from a flow of size k. By definition, R = Nk/Nk > 1 and so 
r > 1, and asymptotically the gain saturates at r = 1/pp. 

For DS it can be shown (see ifTSi for more details) that to 
obtain a ratio a of this maximum gain, a flow must have a 
size of the order of |^|y3^, suggesting that flows must have 
a size of roughly l/pp or more before the sequence number 
'information dividend' effectively switches on. Although intu- 
itively appealing, this is however quite misleading. There is a 
high variance associated with the random variable Nk, which 
for large flows under PS+SYN+SEQ goes as 

Var(iVfc) K — + k{k- 2)pp + {2k - 3). 
Pp 

Therefore the gain R is not sharply concentrated around its 
mean and is highly dependent on the sampling parameter 

More generally, the sequence number dividend 'switch' is 
primarily about whether at least two packets are sampled. 
When p is small this is a very rare event, but can nonetheless 
be responsible for most of the available information about flow 
size. For example in Figure [3] the probability of sampling two 
packets is of the order of 10~^, and yet the ratio of variance of 
PS to PS+SEQ is around 400 - even rare dividends are worth 
having! Thus even for smaller to medium sized flows ('mice' 
and 'rabbits'), the use of sequence numbers provides a huge 
reduction in estimator variance, as shown by numerical and 
empirical evidence throughout this paper. 

It is interesting to compare the SEQ dividend with SH, 
which implements a kind of extreme SEQ effect in the sense 
that if only a single packet is sampled, then all subsequent 
ones will be. This is a more powerful 'switch' than the SEQ 
mechanism which requires at least two packets. For large flows 
however, packets inferred under DS approach this level of 
packet recovery since, because the SYN packet is sampled, 
only a geometric number of packets will be missed at the end 
of the flow, compared to a geometric number at the beginning 
under SH. Of course, SH achieves this solely from its real 
packet budget whereas DS does not, which explains why DS 



outperforms SH even in the latter's area of strength, namely 
very large flows. 

VI. Computational Issues 
A. Optimizing DS with Computational Constraints 

From our result on the monotonic behaviour of ESR nor- 
malized DS, what is optimal in terms of information is clear: 
simply move down the ESR constraint curve to FS. However, 
there are other constraints from the resource side which 
will constrain the parts of the curve that will be accessible. 
The solution to the joint information/resource optimization 
problem is therefore clear: move as far down the ESR curve as 
possible. Our task now is to determine those constraints and 
hence the region in the {pp,pf) plane which is feasible. 

The bottlenecks in the implementation of any sampling 
method are the memory access time and memory size. CPU 
processing is included in the memory access, since the CPU 
has to read out values from memory during lookup, performing 
modifications and write-backs when necessary. These tasks 
constitute the main portion of what is required of the CPU 
for measurement, as the measurement process is basically all 
about efficient counting f5\. This can be done for example 
with a hybrid SRAM-DRAM counter achitecture ||T6l, iflTl . 
ifTSl . where a small amount of (fast but expensive) SRAM is 
used for the counter and its value periodically exported to a 
(cheaper but slower) DRAM store. With about 10 ns access 
time for SRAM, such a counter can be implemented even for 
OC-768 Hnks which run at 40 Gbps. 

We now consider how to optimize DS based on these 
constraints. Regarding flows, let T^ax be the maximum flow 
table size measured in terms of flow records and Af be the 
flow arrival rate. As for packets, let C be the capacity of the 
link, P the size of the smallest packet (« 40 bytes), and t the 
access time of memory (nominally DRAM). 

Our simple analysis of the above bottleneck constraints is 
based on the following. In terms of packet arrivals we assume 
the worst case, namely the smallest size packets arriving back- 
to-back at line rate. By bounding the processing in such a case 
we guarantee that the front line of packet processing (which 
occurs at the highest speeds) is not under-dimensioned. In 
terms of flow arrivals we assume the 'average case' based on 
the average number of active flows. This can easily be made 
more conservative by replacing the average by some quantile 
to take into account fluctuations in flow arrivals. For simplicity, 
we ignore data export constraints from the measurement center 
to a central data collection center. We also do not consider 
rate adaptation based on the traffic condition although such 
schemes are compatible with DS. 

Consider the processing of a single packet. The SYN bit is 
first tested to see which sampling parameter will apply, the cost 
of this is negligible, and if a packet is not sampled no further 
action is needed. Now consider the cost of a packet which is 
sampled. Each SYN packet which is sampled is inserted into 
the flow table. No prior lookup is necessary since it must be 
the first packet of its flow. Each non-SYN packet which is 
sampled must first perform a lookup of its flow-ID in the flow 
table to see if that flow is being tracked. If not it is discarded. 
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The cost of this wasteful per packet implementation of the 
'flow discarding' step (inherent to any SYN based method 
such as DS) is not the bottleneck because the following case 
is the most expensive of all and is the one we model: a non- 
SYN packet which is sampled and whose flow is being tracked 
requires both a lookup followed by an update. 

Using the above, the constraints are 



The diagonals become 



Pf Pf ~ niin ( 
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), Pp <pp 



p 

2^' 



(34) 



The constraint T.,nax/ {DXp) ensures that the average number 
DXp of active flows does not exceed the flow table size. 
Constraint P/{tC) provides the worst case bound for per 
packet processing, for a single operation (insert or update). The 
factor of 2 that appears in the denominator for fSp is to account 
for the worst case, in which a lookup and update is needed. 
The analysis is based on the use of a single per flow counter 
to track flow size. In practice, there may be more counters 
needed to track other quantities of interest, necessitating tighter 
constraints. 

In the sequel, our examples consider traffic mixes that have 
manageable numbers of SYN packets, so that from ( l34l i. 
Pf ~ Tmax/ [DXp]- This is done mainly to illustrate the 
relationship between the CRLB and parameter pf. Indeed, at 
lower link speeds (OC-48 for example), pf is determined by 
the number of flows arriving at the measurement point, rather 
than packet processing time. Secondly, to increase accuracy, 
from the previous discussions, we would like pf < pp, 
i.e. fewer, but better quality sampled flows. 

The constraints form a simple region on the {pp,pf) plane 
that is convex, since it is rectangular with a corner at the origin. 
We want to minimize the variance for each 6k subject to these 
constraints. Since the ESR curve is convex with respect to 
Pf and Pp, the optimal value must lie on the vertex of the 
convex constraint set llT9l Corollary 32.3.1, p. 344]. For this 
to hold, we require that the optimum for 9^ on the ESR curve 
(Section IIV-CI ) is outside the constrained region. This will be 
the case for all k for any reasonable traffic mix. Under such 
conditions, the solution is therefore pf — pf and pp = pp. 

We then have a relationship between the flow table size and 
link capacity and the diagonal elements of J^j. Since it is 
apparent from ( |26] | that the diagonals are dominated by 1/pf, 
by substituting the optimal solution, we have for 1 < j < W, 



(^DS )j3 



0(4) = 0(; ^ 



Pf 



Thus, with a larger table size (i.e. more memory available), 
variance of the estimator can be reduced. 

As for the relation to capacity, we assume that C is large 
and use the approximation (7^ w 1. This can be justified con- 
sidering that sampling is required beyond OC-3 Unk speeds. 
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which are approximately inversely proportional to pp, and 
hence are 0{C). We conclude that the variance of DS is 
inversely proportional to memory usage and proportional to 
the link capacity. 

As an example, consider an OC-192 link with DXp = ^ x 
10^ flows/sec, T„iax = 100, 000 and an access time of 100 
ns for DRAM. Let us assume a further 100 ns is required for 
further processes (e.g. sequence number information). Thus, 
Pf = 0.1 and Pp = 0.08. With our numerical evaluation on 
the Leipzig-II trace (discussed in the following section), this 
would imply that the trace contain at least 9.5 x 10^ original 
flows in to achieve a standard deviation of 10~^ or better. If 
we compare this to PS with a sampling rate of pp = 0.1, 
we require a staggering 5.6 x lO"'^ flows to achieve the same 
performance ! 

To observe the dependence on memory and link capacity, 
now consider an OC-768 link instead with T^ax = 10,000. 
This time, we have pf = 0.01 and pp = 0.02. The number 
of flows required to achieve the same standard deviation now 
increases to 8.8 x 10^ which is still orders of magnitude better 
than PS. 

If we consider the SRAM-DRAM architecture discussed 
earlier, with state-of-the-art SRAM having access times of 
about 5-10 ns, this can only increase the value of pp. Ideally, 
we still keep pf low to reduce the number of flow entries while 
increasing pp to approach FS performance. Continuing on 
from the previous example of the OC-192 link and assuming 
T = 5 ns, can now approach 1. For the OC-768 example, 
Pf ~ 0.01 while Pp = 0.8, giving huge performance gains. 

B. Other Issues 

Checking for the SYN bit can be done in a simple way by 
testing the payload type in the IP header and then verifying the 
presence of the SYN bit. This takes much less effort than deep 
packet inspection systems. Furthermore, sampling decisions 
can be implemented using precalculated values, much faster 
than a straight random number generator implementation. 

We address the problem of flow table overload by using 
the method proposed by Estan et al. 1201 by defining discrete 
measurement time bins, where sampled flows are exported at 
the end of each time bin. Consequently, overload of the flow 
table can be avoided at the cost of increased export rate. 

VII. Case Study on Internet Data 

In this section we test the performance of DS on two traffic 
traces under the ESR normalization, and compare it to FS 
and its closest competitor, SH. We also further examine the 
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benefit of sequence numbers both in a fully empirical setting, 
and an idealised one. Since we work with real data, we 
require an unbiased estimator for each method. We propose 
closed-form estimators that achieve the CRLB asymptotically. 
We test their statistical performance by examining how close 
they approach the empirical CRLB on the traces we used. In 
general, the empirical results match the theoretical ones from 
earlier sections well. 

A. Data Traces 

We used two publicly available network traces, Leipzig- 
n 1211 and Abilene-III ll22ll . which are each collections of 
anonymized packet headers passing through a single router. 
Since the raw traffic is at packet level, specialized software 
such as CoralReef 1231 are required to reconstruct flows. We 
modified the software so that only TCP flows were analyzed. 
Summary statistics of these traces appear in Table Both 
traces are unidirectional, presenting problems when construct- 
ing a sequence number function, as elaborated later. 

There are many flows whose SYN packet is missing as they 
began before the measurement interval. To be consistent with 
our model assumptions, we remove these. A similar situation 
was encountered in HI. Fortunately, such flows are in the 
minority, for example with Leipzig-II they account for only 
18%. Furthermore, when sampling the trace, we assume an 
infinite timeout, that is, flows are expired at the end of the 
measurement interval. This is in accordance with the fact that 
we do not consider flow splitting. In practice, timeouts would 
split a flow, resulting underestimation of flow size. 

Note that some flows may be malformed and have one or 
more SYN packets within the flow. Potentially, DS may sample 
one of these packets and treat it as a new flow, instead of part 
of a longer one. However, such cases are rare. In Leipzig-II, 
there are only 468 of such flows, or « 0.02%, and there are 
none in Abilene-III. These flows were left in the trace. 
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Fig. 9. Comparison of DS on Leipzig-II, using W = 1000 and varying 
parameters under ESR normalization with p = 0.01. 
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2,277,052 
23,806,285 


02:46:01 
00:59:49 


19.76 
16.12 



TABLE III 
Summary of the traces used 



The value of D in Table |III] is the actual average flow size. 
In our experiments, we truncate Leipzig-II to W = 1000 and 
Abilene-III to W ^ 2000, resulting in D being 1.94 and 7.65 
packets for each trace respectively. Truncation is performed 
by discarding all flows with size above W, which ensures that 
the assumption 6'fc > 0, /s = 1, 2, . . ., is met. 

B. Closed-Form Unbiased Estimators 

The analysis of earlier sections were centered on the CRLB, 
which bounds what is achievable by any unbiased estimator, 
but it is neither a construction of such an estimator nor 
even a proof of its existence. When working with real data 
an actual estimator is required, ideally one that achieves 
the previously computed CRLB. The maximum likelihood 



Fig. 10. Comparison of DS on Leipzig-II, using W = 1000 and varying 
parameters under ESR normalization with p = 0.001. 



estimator (MLE) is an attractive candidate as it is asymptot- 
ically efficient, guaranteeing that its performance approaches 
the CRLB asymptotically l24l. However, the MLE, although 
unbiased asymptotically, is in general biased, especially in the 
small sample regime. 

We propose the following estimator for FS and DS (or other 
SYN based methods such as PSh-SYN): 



(35) 



This estimator has a natural interpretation as an empirical 
histogram based on observed sampled flow counts, inverted 
by B^^ to the original flow distribution. It is easy see that 
it is unbiased since E[0] = B^BO = 6. The matrix 
can be computed explicitly for these methods, as shown in 
Section Hn] 

Our estimator is further motivated by its relation to the 
MLE, as we now show (all proofs are deferred to Appendix|E|. 
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Fig. 1 1 . Comparison of DS on Abilene-III, using W = 2000 and varying 
parameters under ESR normalization with p = 0.005. 



Theorem 22: If the sampling matrix of a method satisfies 
bo = q^w^ then the MLE is 

g = B-M / K, M^, . ■ . , M{^]A . (36) 

By rewriting Y^=i ^^'j -^/(^ ~ ^'^U-^f) observing 
that Mq/Nj converges to q with high probabihty (this follows 
from Hoeffding's inequality f25^, p. 303]), the MLE ^ 
tends to our estimator (|35] l. which therefore approaches the 
CRLB asymptotically. Our proposed estimator only obeys 
the constraint 1^/6 = 1 on average, while the estimator 
in Theorem |22] always obeys the constraint. (Note that (|36] | 
remains a viable estimator in the conditional framework, see 
Remark [32] in Appendix |E]) 



The following appUes to SH. 

Theorem 23: The MLE for SH is given by 

^SH = ^ • b(M^ + AfO, M^, . . . , M{yf. (37) 

This time the MLE is unbiased (see Appendix |E]), so we 
use it directly on the data. It closely resembles the estimator 
(|35] |. with a slight difference: for the estimate of 9i, the 
number of missing flows A/q plays a significant role. This can 
be interpreted as follows. Since SH works by geometrically 
skipping the first few packets in a flow, flows of size 1 are 
those most likely to have been entirely missed. Hence, the 
simplest way to incorporate a knowledge of Mq is to assume 
that it arises solely from evaporated flows originally of size 1 . 

The advantage of simple closed form estimators such as 
these, which only require a matrix multiplication, is that they 
eliminate the need for iterative estimation algorithms such 
as Expectation-Maximization (EM), often employed in the 
literature H], ||4|, lH. From a computational viewpoint, this is 
highly advantageous. 

C. Testing with a Perfect Sequence Number Function 

We begin our case study by testing DS with a perfect 
sequence number function, which returns the exact number 
of packets between two sampled packets. As we have access 
to the original unsampled flows, this is easily evaluated. Apart 
from being a benchmark for sequence number functions that 
may be designed in the future, a perfect function allows clean 
comparisons between alternative methods employing sequence 
numbers to be made. 

In Figure |9] for an ESR of p = 0.01, values of pp from 
Pp = \ (equivalent to FS) steadily decreasing to pp = 0.001 
are shown (DS with pj = pp — 0.019 corresponds to 
PS+SYN+SEQ). As Pp — ^ 1 the performance vastly improves. 
Similar results were observed in Figure [lOl where p = 0.001. 




Fig. 12. Comparison of FS, SH (pr = 0.0002) and DS {pf = 0.0117, 
Pp = 0.7) on Leipzig-ll, using W = 1000 and varying parameters under 
ESR normalization with p = 0.01. 



Fig. 13. Comparison of FS, SH (pr = 0.00018) and DS ipf = 0.0134, 
Pp = 0.7) on Abilene-lll, using W = 1000 and varying parameters under 
ESR normalization with p = 0.01. 
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Fig. 14. Comparison of DS on Leipzig-II, using W = 1000 and varying Fig. 16. Comparison of DS on Abilene-lll, using W = 2000 and varying 
parameters under ESR normalization with p = 0.01. An imperfect sequence parameters under ESR normalization with p = 0.005. An imperfect sequence 
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Fig. 17. Benefits of using sequence numbers. Three cases are shown: 
Fig. 15. Comparison of DS on Leipzig-II, usmg W = 1000 and vaiying pERpECT uses a perfect sequence number function, MAPPING uses an 
parameters under ESR normalization with p = 0.001. An imperfect sequence i^^^^^^i function and SEQ OFF uses no sequence number information, 
number function is used here. 



In all cases, a chronic lack of samples at the tail end of the 
distribution causes inaccurate estimates, with zero samples 
showing as discontinuities. FS holds to the true distribution 
the longest, as we expect, and the best DS performs similarly 
to it. With a much larger sample set in Figure [TT] we can 
see better agreement between the estimates and the original 
distribution. 

We also tested SH, as seen in Figures [12] and [13] To simplify 
the comparison across the traces, we truncate each to W = 
1000, so that D becomes 1.94 and 6.45 packets for Leipzig-II 
and Abilene-III respectively, and use the same ESR value p = 
0.01. In both cases the performance is much worse for SH than 
DS, which tracks FS and the true distribution well. In Leipzig- 
II the performance is very poor almost everywhere, while in 
Abilene-III the front end of the distribution, up to about flows 
of size 8, is estimated quite well, before deteriorating badly 



at larger sizes. The good performance of SH at j = 1 can be 
attributed to the fact that this event is dominated by the sheer 
number of original flows of size one. Both FS and DS tracks 
the distribution much better since both methods have sampled 
flows of far superior quality to SH. 

D. Testing with an Imperfect Sequence Number Function 

We now test DS with an imperfect sequence number func- 
tion. Our function is similar to that outlined in Q. However, 
as we do not have statistics of popular TCP payload sizes 
available, we infer the most likely payload size as follows. 
If the sequence number difference is divisible by a popular 
payload size (for example, 1460 bytes), we take this as the 
most likely payload size. Otherwise, we use the average 
payload size. This function is subject to errors, especially 
when a flow has variable payloads, however it suffices for 
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our purposes. 

In addition to TCP sequence numbers, we exploit IPID num- 
bers. As mentioned in |4J, the IPID field of Linux machines 
is incremented sequentially for each TCP flow every time a 
packet in the flow is transmitted. Given that the majority of 
web-servers on the Internet are Linux-based, we exploit IPID 
numbers to check the accuracy of our estimate when a flow 
has packets with variable payloads. 

Furthermore, the unidirectional nature of the trace presents 
a significant challenge. As one side in a TCP connection 
usually transmits more data than the other, some sampled flows 
may consist mainly of TCP ACK packets, which means that 
they will have zero-byte TCP payloads. In this case sequence 
numbers may not be incremented at all, and so will not provide 
information about the number of bytes transmitted. A solution 
is to use the TCP acknowledgement numbers instead to infer 
the number of bytes transmitted from the opposite direction, 
which would yield an estimate of the number of packets in 
the TCP ACK flow. This may not be the most ideal solution, 
as this method would be susceptible to delayed ACKs, thus 
underestimating the size of the flow. 

Modern web browsers rely on maintaining TCP connections 
rather than initiating new connections, which require more 
memory. This is the persistent HTTP protocol. The prevalence 
of this protocol amongst web browsers presents a challenge, 
since empty payload packets are periodically sent to keep the 
connection alive. These packets do not increment the sequence 
number. The best we can do in such cases is to infer using IPID 
numbers, or possibly counting ACK packets coming in from 
the other direction, if bidirectional information is available. 

Even with the imperfect and relatively simple sequence 
number function used here, results are consistent with theory. 
Figures [l4l and [TS] illustrate this. In both cases, the imperfec- 
tion of the function affects the accuracy of DS, but not to 
a large degree. A similar observation applies to the Abilene- 
III trace, shown in Figure [16] where the artifacts due to the 
imperfect function (the sawtooth pattern) are clearly seen. 

Finally, Figure [TT] illustrates the effect of using sequence 
numbers in recovering the flow size distribution. The three 
cases shown are for DS with parameters pf = 0.00117 and 
Pp = 0.7, with an ESR of p = 0.01. The PERFECT case 
is when DS is given a perfect sequence number function, 
MAPPING when using our sequence number function, and 
SEQ OFF when no sequence numbers were used at all. 
It is apparent that using sequence numbers, even with an 
imperfect function, provides significantly more information to 
an estimator. 

E. Empirical estimator variance 

Here, we see how closely the estimator variance matches 
the CRLB by computing the observed Fisher information 
1261 of the estimator in Figure [18] To improve readability, 
smoothing was applied to the tail end of the observed Fisher 
information where samples are scarse using a simple moving 
average filter with a window size of 100. Even when using 
an imperfect sequence number function, the variance of the 
estimator closely matches the CRLB, effectively proving the 
benefit of sequence numbers. 
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Fig. 18. CRLB versus observed MLE variance of DS with pf = 
0.00625, Pp = 0.6 on Abilene-III. 

VIII. Conclusions and Future Work 

We have re-examined the question of sampling for flow 
size estimation in the context of TCP flows from a theoretical 
point of view. We used the Fisher information to examine the 
inherent potential of a number of sampling methods. Most 
of these had been examined previously, but we showed how 
the usual conditional framework can be made unconditional 
and thereby simplified, which actually changes the sampling 
methods themselves and their performance. The new frame- 
work led to a number of new rigorous results regarding the 
performance of sampling methods which we studied under 
two different normalizations. It also enabled flow sampling to 
be compared to methods using TCP sequence numbers and 
Sample and Hold for the first time, and we showed that it is 
far superior to them, except in very special cases which are 
not important for network measurement applications. 

We introduced a new two parameter family of methods. 
Dual Sampling, which allows the statistical benefits of flow 
sampling to be traded off against the computational advantages 
of packet sampling. We discussed how, as an unbiased 'Hold 
and Sample' method, it differs from Sample and Hold, and 
proved that it is superior to it. We argue that the scheme is 
implementable and offers an efficient way of approaching flow 
sampling in practice to the extent possible. We also proposed 
closed-form unbiased estimators for SYN-based methods and 
SH which asymptotically achieve the CRLB, saving compu- 
tational time in the estimation stage. 

We performed a case study of Dual Sampling and Sample 
and Hold on two Internet data traces using our proposed 
estimators, and found results entirely consistent with the 
theoretical predictions, despite the fact that the function which 
maps sequence numbers to packet counts introduces a new 
source of error, and was not highly optimized. Although there 
is high variation at the tail end of the distribution, our proposed 
estimator closely matches the CRLB. 

In future work, we intend to search over the space of all 
possible sampling matrices to find an optimal sampling method 
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for flow size estimation, and to compare it to flow sampling. 
Our framework is applicable to other traffic metrics such as 
anomaly detection and a future direction is to extend the work 
to those areas. It would also be of interest to improve the 
sequence number mapping function, and also to explore using 
our approach for the direct estimation of the byte size of flows, 
for which sequence numbers are more naturally suited, rather 
than the packet size. Finally, we will develop a more detailed 
case for DS and its implementability in high speed routers. 

Appendix A 
Our Matrix Lemmas 

A. General 

Lemma 24: The matrix J(0) and its inverse J(0)~^ are 
symmetric and positive definite. 

Proof: For simplicity we omit the dependencies. Recall 
that J = B^DB where D is a real diagonal positive definite 
matrix with iT))jj = and rank(D) = W. Matrix B is a 
W X W matrix and rank(B) = W. It follows that an inverse 
exists for both D and B. Hence, an inverse also exists for J 
since J-i = B-iD-1(BT)-i. 

An equivalent expression for J is 

J = bTdI/^dI/^B = (D1/2b)T(D1/2b) (38) 

since D^/^ is symmetric. 

We can now show that J is symmetric. We have 

JT = [(D1/2B)T(D1/2B)]T 

= (D1/2B)T(d1/2B)=J. 

The form (D^/^B)"'"(D^/^B) is at least positive semidefinite 
(Theorem [28Tl. However, since J^^ = B-iD-i(B'^)-\ J is 
invertible, it is also positive definite. By definition of symmet- 
ric, positive definite matrices, its inverse is also symmetric, 
positive definite. ■ 
Lemma 25: The unconstrained Fisher information matrix 
J{0) and its inverse J{6)^^ are symmetric and positive 
definite. 

Proof: Recall that J = B^DB where D is a real diagonal 
positive definite matrix with (D)jj = dj\ and rank(D) = 
W + l. Matrix B is a {W + 1) x W matrix and rank(B) = W. 
An equivalent expression for J is 

J = BTdi/2di/2B = (D1/2B)T(di/2B) 

since D^/^ is symmetric. Now 

JT = [(D1/2B)T(D1/2B)]T 

= (Di/2B)Tpi/2B) ^ 

From Theorem l28l (D'^^'^B)'^ (D^^^B) is positive semidefi- 
nite. Moreover, J is invertible by Proposition |2] implying it is 
positive definite. By definition of symmetric, positive definite 
matrices, its inverse is also symmetric and positive definite. ■ 

Lemma 26: Ji > J2 if and only if < . 

Proof: Since by Lemma |27l ii). Ji > J2 iff Jj~^ < 
^2^, then by definition of positive semidefinite matrices, 
x"'"Jj^^x < x"'^J2'^x. Hence, this implies x"'^(Jj^^ —99^):x. < 
xT(J-i - 00'r)x, implying Ji > J2. ■ 



B. Proof of Theorem |4] 

Let E = (l/do)bobJ from ( fTol i. It has rank 1 and is there- 
fore positive semidefinite since its eigenvalues are tr(E) = 
12^=1 ^ofc/'^o with multiplicity 1 and with multiplicity 
W -1. U follows from Lemma |29] that J(6») > J(6») since 
3{0) = E + 3 (6), and from Lemma |27] this implies that 
J-i(0) - J-i(0) > Owxw and therefore 3~^{0) < 3~^{9). 
Equality can only hold iff E = Ovfxw since this is the only 
case where all eigenvalues of E are zero. This implies that 
bp = Owxi is required for equality, that is that no flow can 
'evaporate'. 

Appendix B 
Other Matrix Lemmas 

We collect some useful results required in this paper here. 
This first result comes from |[T2l . 

Lemma 27: Let A be a n x ri symmetric positive definite 
matrix and B an n x 71 positive definite matrix. Then 

(i) If B — A is positive definite, then so is A^^ — B^^, 

(ii) If B — A is symmetric and positive semidefinite (imply- 
ing B is symmetric), then A~^ — B~^ > 0. 

The following theorem appears in ll27l Theorem 6.3, p. 161]. 
Lemma 28: The following statements are equivalent: 

(i) A is positive semidefinite; 

(ii) A = B*B for some matrix B, where B* is the 
conjugate transpose of B. 

The following result gives more properties of positive semidef- 
inite matrices 1271 Theorem 6.5, p. 166]. 

Lemma 29: Let A > and B > have same size. Then 

(i) A + B > B, 

(ii) A1/2BA1/2 > 0, 

(iii) tr(AB) < tr(A)tr(B), 

(iv) the eigenvalues of AB are all nonnegative. 

The matrix inversion lemma (also known as Woodbury's 
formula) can be found in Qll Theorem 18.2.8, p. 424]. 

Lemma 30 (Matrix Inversion Lemma): Let R be a 71 x ?i 
matrix, S a ri x ttt, matrix, T a m x m matrix, and U a jti x ?i 
matrix. Suppose that R and T are nonsingular. Then, 

(R + STU)-i ^ R-i - R-iS(T-i + UR-iS)-iUR^^ 

The data processing inequality for Fisher information from 
1281 is as follows. 

Theorem 57.- If 8 — 5- F — > X satisfies a relation of the 
form f{y,x\Q) — f0{y)f{x\y) (i.e. the conditional distribu- 
tion of X given Y is independent of 0), then J y (O) < Jy (8) 
with the deterministic version being J-y(y)(8) < Jy(8). 
Equality holds if 'y{Y) is a sufficient statistic relative to the 
family /e(y), i-C- 8 — > liY) ^ Y forms a Markov chain. 

Appendix C 
Sampling Methods 
A. Proof of equation dlTI l 

Expanding (1 + (—1))'^ leads to the useful identity 

E(-ir^Q) = (-ir^ (39) 
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Using dnii and b'^^. = (-1)''"^ . jqp^^Pp'' for PS, we have 



W 2 

^o + E£i^.(,Jpp'=Et=i(-i)'=-^©)' 

The denominator can be simplified as follows: 



k=l i=\ 

W k 



k-l 



rfo + E4-9fp;^^-(E(-i) 

k=l £=1 
W 



k-i 



fc\\2 



fc=l 



W 



E„2k„-2k, 
Qp Pp «fc 



fe=0 



where we used identity ( |39] l in the second line. 
Similarly, the numerator can be simplified: 



W k 



W 



k\ fk 



E 4 (^) i-if-^qf-^p-^'^Y.^-ir' 

k=j ■' £=1 



w 



w 



E4(]j(-i)^-^<7f-V'(-i)'"' 



Y^i-ir-'-'d^i^^p'^-^p-^^ 

where (|39] l is used in the third line. This proves (fTTI l. 



applying SH with some Pp. We turn instead to a direct method, 
exploiting the tridiagonal structure of J^jj. 

Denote ~ ciij, and consider the quadratic form 

x^J^Jx for any non-zero vector x G 



pW. 



^'^Jsh'^ = ("1,1 + 01,2)2^1 + {aw,w + aw-i.w)x\r 



(40) 



w-i 

+ E ("jj + "-j'-ij + "^jj+i^ 

J = l 

From Lemma|5]we know 3^^1w = 6- It follows that 

(ai4 + ai.2) = 6*1 
(ajj + fli-ij - Oij+i) = .7 = 2, . . . , W - 1 

(OVK.W + 0'W-l,w) = ^iv 

which are each independent of pp. Consider then the term 
involving -ajj+i = Pp'^qpdj+i = EkLj+i 1p~^ ^k, 1 < 
j < W. Differentiating with respect with pp, we obtain 

w , w 



p fc=j+i 



Appendix D 
Comparisons 

A. Proof of Theorem \T4\ 

Let Z denote any method except SH, and Y the complete 
outcome (i.e. the vector of SEQ numbers and the SYN variable 
in the richest case) of sampling with Z using parameter pi 
(for DS (pp,i,p/,i)). Now sample Y using Z with parameter 
P2/P1 to form X (for DS {pp^2/Pp,i,Pf,2/pf,i))- Since X is 
a function only of Y and the new sampling, the DPI for 
Fisher applies (Theorem ISTT l (and X C Y). Furthermore, 
it is easy to see that X is statistically equivalent to the 
outcome of sampling the original data using Z with probability 

PliP2/Pl) = P2 (for DS {Pps{Pp,2/Pp,l),Pf,l{Pf,2/Pfs)) = 

iPp,2,Pf,2))- It follows that Jz(pi) > Jz(P2)- Equality holds 
iff p2 = Pi (for DS Pp.1 = pp^2 and = p/,2) implying 
X = y, since sampling inherently discards information. 

The above proof does not apply to SH as it does not 
have a closure property, meaning that X is not equivalent to 



Since this is negative, it follows that x^J^Jx decreases 
monotonically in pp, and so Jsh^(pi) < Jsh(P2) for any 
Pi > P2- The result then follows by Lemma [ZTl ii). 

B. Proof of Theorem |77| 

We first consider PPR normalization, where pp = p for both 
methods. Recall from ( [TtI i that 

w s 2 

(Jp-/).. = E ( •) l'"'-''p-"'dk,, 

k=i 



{Y.Zj{-^Y'-'-'dk. 



PS 



g 2 A; j 2 k 



Y^kLol'^'^P '^''dk,ps 



w s 2 

>E <z'"=-V-4,PS-9-'^"(Vt^-J + l)M^! 

k=j ^-^^ 

" V ' 

Ti 

where the lower bound follows by, in the second term, drop- 
ping the first i terms in the denominator and upper bounding 

'k\ 

.1 by M^!. Also, from 

W />_-,\2 

(Jp-s^.SYN).. = E ■ _ 1 l'''-''p-''dk.s.s.n 

k=] / 



Since 



tk-l\ 



> 



T2 



w 



dj.?5 

implying that Ti > T2 



> E ( _ [)P'^^ = dj,PS+SYN, 

k=l ' 
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Now under the limit p — 
term for PS becomes —{W 
whereas for PS+SYN, 



0, for j > 1, the second 
j + 1)W\ which is finite, 
— oo. It follows 



p J p J 

that (Jps+syn)jj — (Jps^)ji- Since Pp > p for PS+SYN under 
ESR, the result also holds for ESR by Theorem [H 

C. Proof of Theorem \T8\ 

We begin by examining the simplest case, where j = W. We 
have 



and 



ps+syn+seqM'w 



'^W qp rfi 

Pp Vp 



and so (Jps+syn+seq)^^^ < (Jps+seq)^^ under PPR. 

Now consider the case i < j <W — \. Let dg j and dsq.j 
be the proportion of sampled flows of size j after sampling 
by PS+SEQ and PS+SYN+SEQ respectively. For j > 1, a 
straightforward comparison would yield dg j > dsQ.j- Intu- 
itively, since more flows are discarded in the PS+SYN+SEQ 
scheme, the proportion of sampled flows must be less than the 
proportions of PS+SEQ. Therefore, expressing the diagonals 
in terms of dgj and dsQj, we have for 3 < j < W — 1, 

(Jps+seq)jj = Pp^dqj + Aq^p-^dQ^^+i + q^Pp^dQ^j+2 
and 

(Jps+syn+seq)jj = Pp^dsQ,j + qlPp^dsQ,j+i - qpP'^O] 
< Pp'^dsQ,j + q^p-'^dsQj+i, 

which, by a direct comparison, shows (Jps+syn+seq)jj 
— (Jps+SEQ)jj- Similarly, the ESR comparison follows since 
the sampling rate for PS+SYN+SEQ must increase, thereby 
reducing its CRLB. 

D. Proof of Theorem [79] 

First consider the simplest case, j = W. By substituting {Tl\ 
into ( |26] | and then differentiating w.r.t pp 



dpp'"""'"" plpD pD 

implying {i^l)ww is monotonically decreasing with pp. 
For 2 < j <W — 1 the derivative ^|^(Jos)jj is given by 



e, (D - 1)02 1 



w 



plpD pD 



pIpD 



which is negative since each term is negative for < Pp < 1. 
Finally, for = 1 we have 



:3^(Jds)ii = ^^i(1-^i)- 
dpp pD p. 



Mir'') 



For small values of pp the expression is dominated by terms 
in 1/pp and is therefore again negative, but as the first term 
is positive, for large pp it may change sign. It is not hard to 
show that > 0, so at most one sign change is 

possible. Setting pp = 1 in (l4Tl i yields (ISTT i as the necessary 
and sufficient condition for this not to occur in the feasible 
region pp < 1. The special cases follow simply from dSTT i. 

E. Proof of Theorem |20] 

First consider the case 2 < j < W. From (ISTT i and ( |24] | 

[•fsH)n ^ - p - p p^j-y-^^5)rj 

since Pf ~ p and pp = Pp{p) < p under both PPR and ESR. 

Now consider j = 1. It is convenient to recall ( l23T l and ( l2Tl i: 

fe,and(J-/)n = ^0i-^e?. 



(Jsh)iI - ^1 + ^ I]fc=2 9p ^fc, a..u,_^ps 

It follows that (Jp/)ii < (Jsh)ii when 

Pl^k=2% ^fc 
9i(l-0i)(l-p)- 



Pp 



< 



(42) 



A sufficient condition implying ( l42b is obtained by using the 
lower bound qp62 < X]fe=2 qp^^^k and rearranging, yielding 



Pp 



< 



(43) 



Oiii - ei)ii - p) + pe2 

Furthermore, since pp < p, a more restrictive sufficient 
condition is given by replacing the l.h.s. with p, which reduces 
to p{l - p)(9i{l - 9i) - 6*2^ < 0, which shows that for 
any < p < 1, if 02 > 6*1(1 - 6*1), then dUli holds and 
(Jfs^)ii — (JsH The condition is satisfied if W = 2. 

Now let be arbitrary and consider the small p (and hence 
small Pp) limit. Then ( |42] | becomes pp < p/Oi, which is 
always satisfied since pp < p. 

Appendix E 
Maximum Likelihood Estimators 

A. Proof of Theorem |22] 

The likelihood function for TV/ flows is 



f{e,Nf) = Y[fijf,9) = l[d 



j ■ 



J>0 



The MLE is the 6 which maximizes the log-likelihood 



1(9, Nf) 



j>0 



subject to the constraint J2k>i = 1, Ok > 0, Vfc. The opti- 
mization problem admits a feasible solution by the Bolzano- 
Weierstrass theorem ||29l p. 517], since the log-likelihood func- 
tion is concave and continuous, and optimization is performed 
over a compact, convex set. Furthermore, the solution obtained 
will be unique under our assumptions, since the Hessian of 
the log-likelihood is the Fisher information, which is positive 
definite given < 6'^ < 1 for all k. 

Given the assumptions, the method of Lagrange multipliers 
would yield the optimal solution since strong duality holds 
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as the problem satisfies Slater's constraint qualification ||30l 
Section 5.2.3, p. 226]. The Lagrangian is 

c{e, \,u) = Y, Mj \ogd, - a( ^ - 1) - u'^e, 

j>0 k>l 

where the vector u has elements i^k > and A £ M. By 
differentiating with respect to Ok, Vfc and the multipliers, 

k>j 

^£(0,A,i/) = l-^0fc = O, 

k>l 

—c{e,\,u) = 9k = o. 

The second equation is just the equality constraint while the 
third yields a solution 6 = Ovi', which lies on the boundary, 
yielding an unbounded solution (observed by substituting the 
solution into the likelihood function). That leaves the first 
equation, and by the Karush-Kuhn-Tucker condition, u^O = Q 
ll30l Section 5.5.3, p. 243], implying that u = Qw (our 
assumptions require that the original parameters Q < Ok < 1 
for all fc, hence the optimal must lie within the region where 
the constraints are inactive). Thus, we have, in matrix form, 

- - M' 

B^D diag(M{, . . . , M'w)lw = \lw ~ -j^bo, (44) 

recalling that D = diag((i||~^, . . . , d^). 

We proceed to solve for A using the equality constraint 
Efc>i ^fe = 1 and d = B0, as follows, by multiplying both 
sides of dHJl by 0^ to obtain 

6>^B'^D diag(M{ , . . . , M'„)lw = \-M'„ 
d^D diag(M{ , . . . , M'^)lw = A - M'^ 
l^diag(M{, . . . , M{^)lw = A - Mi 

implying X = Nf. 

For methods that pick flows in an unbiased manner, bg ~ 
qlw, and thus B'^lw = pl-w, implying (B"-^)'^!^^ — 
p~^l\Y, therefore ( l44l i reduces to 

BTDdiag(Af{, . . . , M{^)lw = {Nf - M^)!^. (45) 

which can be rewritten as 

I)-\B-Yiw - — ^— diag(A./{, . . . , M{y)lw 

Nf - AIq 

The result follows. 

Remark 32: In the conditional framework, expressing the 
log-likelihood function is difficult, due the fact that normaliza- 
tion of the likelihood involves division by random variables. 
However, the estimator above would, with high probability, 
be close to the actual MLE. The flow selection process is a 
Bernoulli process. The denominator, ^^j encapsulates 

information about Nf, because asymptotically, the deviation 
between pNf and Y^JLi extremely small, a consequence 
of the concentration of BernoulU samples around its mean. 



This property is not found amongst other methods, such as 
PS, where samples are biased towards large flows. 

B. Proof of Theorem |25] 

We begin with the optimization equation (l44l l. 

- - M' 

B^D diag(M{ , . . . , M'^)lw + -r^bo = Nflw 

"0 

Using properties of the sampling matrix for SH, we have 

D diag(A/{, . . . , M{^)lw + ^ • -ei = Nf{B-^)'^lw 

do p 

(46) 

D diag(A/^ + Mi . . . , M{y)lw = Nf{B-Ylw 

(47) 

I)diagip{M[, + M[),...,M{^)lw =Nflw (48) 

where in (|46] | we use the property bj = ^B^ei, where is 
the canonical basis vector, in (l47b we use do = ^di, and in 

(gUl we use (B-'^)^lw = diag(p-\ 1 . . . , l)lw- All these 
properties can be obtained by a straightforward evaluation 
using the sampling matrix. The final line reduces to 

B9 = l--[p{M;, + M[),...,Mi^f, 

proving the result. 

The estimator is unbiased, as by taking the expectation, we 
obtain Eb(A/^ + 7\/{)] = Nf{pdo+pdi) = Nf{p + q)di ^ di, 
by using the identity di = while clearly E[A'/j] = dj for 

all j > 2. Thus E[0sh] = Bsj^^Bsh^ = 6. 

References 

[1] N. Duffield, C. Lund, and M. Thorup, "Estimating Flow Distributions 

from Sampled Flow Statistics," IEEE/ACM Trans. Networking, vol. 13, 

no. 5, pp. 933-946, Oct 2005. 
[2] N. Hohn and D. Veitch, "Inverting Sampled Traffic," in Proc. 2003 ACM 

SIGCOMM Internet Measurement Conference, Miami, October 2003, 

pp. 222-233. 

[3] , "Inverting Sampled Traffic," IEEE/ACM Transactions on Network- 
ing, vol. 14, no. 1, pp. 68-80, 2006. 

[4] B. Ribeiro, D. Towsley, T. Ye, and J. Bolot, "Fisher information 
on sampled packets: an application to flow size estimation," in 
Proc. ACM/SIGCOMM Internet Measurement Conf., Rio de Janeiro, Oct 
2006, pp. 15-26. 

[5] G. Varghese, Network Algorithmics. San Francicso: Elsevier/Morgan 
Kaufmann, 2005. 

[6] C. Estan and G. Varghese, "New directions in traffic measurement and 

accounting," ACM Transactions on Computer Systems, vol. 21, no. 3, 

pp. 270-313, August 2003. 
[7] R Tune and D. Veitch, "Towards Optimal Sampling for Flow Size 

Estimation," in Proc. ACM SIGCOMM Internet Measurement Conf., 

VouUagmeni, Greece, Oct.20-22 2008, pp. 243-256. 
[8] L. Yang and G. Michailidis, "Estimation of flow lengths from sampled 

traffic," in Proc GLOBECOM, San Francisco, CA, Nov 2006. 
[9] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. 

John Wiley and Sons, Inc., 2006. 
[10] A. Hero, J. Fessler, and M. Usman, "Exploring estimator bias-variance 

tradeoffs using the Uniform CR bound," IEEE Trans. Sig. Proc, vol. 44, 

no. 8, pp. 2026-2041, Aug 1996. 
[1 1] J. D. Gorman and A. O. Hero, "Lower bounds for parametric estimation 

with constraints," IEEE Trans. Info. Th., vol. 26, no. 6, pp. 1285-1301, 

Nov 1990. 

[12] D. Harville, Matrix Algebra from a Statistician's Perspective. Springer- 
Verlag, 1997. 

[13] J. E. Strum, "Binomial matrices," The Two Year College Mathematics 
Journal, vol. 8, no. 5, pp. 260-266, November 1977. 



25 



[14] E. L. Lehmann and G. Casella, Theory of Point Estimation, 2nd ed., ser. 
Springer Texts in Statistics. Springer, 1998. 

[15] P. Tune and D. Veitch, "Fisher information in flow size distribution es- 
timation: Technical report," Dept. E&EE, The University of Melbourne, 
Tech. Rep., 2008, copy available upon request. 

[16] D. Shah, S. Iyer, B. Prabhakar, and N. McKeown, "Maintaining statistics 
counters in line cards," IEEE Micro, vol. 22, no. 1, pp. 76-81, 2002. 

[17] S. Ramabhadran and G. Varghese, "Efficient implementation of a statis- 
tics counter architecture," ACM SIGMETRICS Performance Evaluation 
Review, vol. 31, no. 1, pp. 261-271, June 2003. 

[18] Q. Zhao, J. Xu, and Z. Liu, "Design of a novel statistics counter 
architecture with optimal space and time efficiency," Prof, of ACM 
SIGMETRICS 2006, vol. 34, no. 1, pp. 323-334, June 2006. 

[19] R. T. Rockafellar, Convex Analysis, ser. Princeton Landmarks in Math- 
ematics and Physics. Princeton University Press, 1970. 

[20] C. Estan, K. Keyes, D. Moore, and G. Varghese, "Building a better 
netflow," in Proc. ACM SIGCOMM 2004. Portland, OR, Aug" 2004. 

[21] NLANR, Leipzig-ll Trace Data, http://pma.nlanr.net/Special/leip2.html. 

[22] , Abilene-III Trace Data, http://pma.nlanr.net/Special/ipls3.html. 

[23] CoralReef, http://www.caida.org/tools/measurement/coraheef/. 

[24] S. Kay, Fundamentals of Statistical Signal Processing, Volume I, Esti- 
mation Theory. Prentice Hall, 1993. 

[25] M. Mitzenmacher and E. Upfal, Probability and Computing. Cambridge 
University Press, 2005. 

[26] G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions, 
2nd ed. Wiley Interscience, 2008. 

[27] F. Zhang, Matrix Theory: Basic Results and Techniques. Springer- 
Veriag, 1999. 

[28] R. Zamir, "A proof of the Fisher information inequality via a data pro- 
cessing argument," IEEE Transactions on Information Theory, vol. 44, 
no. 3, pp. 1246-1250, May 1998. 

[29] A. E. Taylor and W. R. Mann, Advanced Calculus, 3rd ed. John Wiley 
and Sons, 1983. 

[30] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge 
University Press, 2004. 




sparse signal recovery. 



Paul "nine (M'09) received the B.E. (Hon.) and 
B.Sc. degrees in Electrical and Electronics Engi- 
neering and Computer Science respectively, both 
in 2005, from the University of Melbourne, Aus- 
tralia. He has recently completed a Ph.D. degree 
at the ARC Special Center for Ultra-Broadband 
Information Networks (CUBIN) at the University 
of Melbourne, Australia. His research interests 
are in communication networks, particularly in 
traffic inversion and sampling, information the- 
ory, statistics and signal processing, specifically 



Darryl Veitch (F'lO) completed a BSc. Hons, at 
Monash University, Australia (1985) and a math- 
ematics Ph.D. from Cambridge, England (1990). 
He worked at TRL (Telstra, Melbourne), CNET 
(France Telecom, Paris), KTH (Stockholm), IN- 
RIA (Sophia Antipolls, France), Bellcore (New 
Jersey), RMIT (Melbourne) and EMUlab and 
CUBIN at The University of Melbourne, where 
he is a Professorial Research Fellow. His research 
interests include traffic modelling, parameter es- 
timation, active measurement, traffic sampling, 
and clock synchronization. 




PS 

o PS+Syn 

>^ PS+Seq 
^ DSpf>pp 

PS+Syn+Seq 

DSpf<pp 

. SH 
PS 




Flow Size k 



10 



^^ "K- V ^^ . 



